If you work in a museum and you've spent the last twelve months getting cold emails about "AI-powered audio tours," this guide is for you. It is the piece I wish I could have handed every curator and visitor-experience director I've talked to over the past year — before the vendor calls, before the demos, before the procurement spreadsheet.
I'm Eric Duffy. I run Convo, a platform in this category. I have an opinion. I have also tried to write the guide I'd want to read if I were buying, not selling — including a section on where this category isn't the right answer. If you want a vendor-neutral primer with the math, the trade-offs, and a clear vocabulary, keep reading.
What is an AI audio guide?
An AI audio guide is an audio interpretation layer for a museum or cultural site where the drafting, voicing, translation, and updating of the audio is produced by software working from materials a curator provides, rather than by a studio production pipeline. Visitors typically reach it by scanning a QR code on a wall card or label, which opens a web player on their own phone. The category sits inside the broader "audio guide" category that museums have had a budget line for since the 1950s — what's new is the production tooling and, in some platforms, the ability for visitors to ask follow-up questions instead of only listening.
A useful way to picture the layers:
- The source. A curator's catalog notes, wall text, exhibition essays, label copy, a CSV from the collections management system — whatever the museum already has written.
- The draft. Software produces a first-pass script grounded in those sources. The curator reviews, edits, and approves before anything reaches a visitor.
- The voice. Neural text-to-speech converts the approved script to audio. The latest generation of these voices is increasingly hard to distinguish from a studio recording — in the Interspeech 2025 State of TTS evaluation, the top commercial TTS model achieved a 71.49% "human fooling rate" in blind listening tests, slightly above the human reference baseline.
- The language layer. The same approved script is regenerated and revoiced in additional languages — typically eight to ten — without re-booking studios or contracting new talent.
- The delivery. Web playback on the visitor's phone via QR code. Some platforms also offer chat or voice Q&A at each stop, grounded in the same source materials.
The most important phrase in that list is grounded in those sources. A well-designed AI audio guide doesn't generate plausible-sounding content about your collection from the open internet. It is constrained to what your curators have already written. Vendors that can't show you how that grounding works are worth treating with caution.
How does an AI audio guide actually work?
The shape of the production pipeline is consistent across serious platforms, though vendors differ on every step. From the curator's seat it usually goes: upload reference materials, generate a draft, edit the draft, approve, voice, regenerate in other languages, publish. From the visitor's seat: scan, listen, optionally ask a question.
Under the hood, the script-drafting step is usually a large language model conditioned on the curator's uploaded materials, often using retrieval-augmented generation so the model is answering from those documents rather than from general training data. The 2025 literature on RAG is honest about what this technique does and doesn't fix: grounding meaningfully reduces hallucination, but as the HalluLens benchmark and Stanford's 2025 legal-RAG reliability work both show, it does not eliminate it. This is why the curator-as-editor step is non-negotiable — every line a visitor hears should have been approved by a human who knows the material.
The voicing step is neural TTS. The translation step is usually a large language model again, often with a translation-tuned variant or a separate provider per language. The visitor Q&A step (where platforms offer it) is another grounded LLM call, this time at runtime, constrained to the same source set and ideally citing back to the document it pulled the answer from.
The two architectural choices that most affect the quality of what visitors actually experience:
- Where grounding lives. Does the platform constrain answers to the museum's own source files, or does it fall back to the open web when uncertain? The honest platforms decline to answer when they can't ground a response. The careless ones invent.
- Whether updates are live. When a curator fixes a line, does the change ship to visitors in seconds, or does it queue for a rebuild? "Same-day correction" is one of the headline arguments for the category over studio production — if a platform can't deliver it, the math gets less convincing.
This piece is the hub for our AI audio guides pillar. For a side-by-side on the production math versus the studio model, see the spoke on AI audio guides vs traditional audio guides.
Who is an AI audio guide actually for?
The honest answer: museums that want to give more visitors more interpretation, in more languages, than their current production budget allows. That's most institutions. The AAM's 2025 Annual Survey of Museum-Goers — fielded across 202 participating museums and 98,904 respondents — keeps showing the same pattern: visitors arrive expecting digital interpretation, and they notice when it isn't there.
The specific institutions where I've seen the category land most cleanly:
- Mid-size art and history museums that have a strong permanent collection, rotating exhibitions, and a small interpretation team. Studio production is too slow for the rotating side; AI tools let the same team cover both.
- Multi-site institutions — historic-house networks, regional museum systems, university museums with branch sites — where the per-site production cost of traditional audio is prohibitive but the visitor expectation is the same at every location.
- Tour operators and visitor centers at heritage sites, national parks, and walking-tour companies, where multilingual reach is the difference between serving the visitor and not.
- Museums in cities with significant non-English-speaking audiences. Per the US Census Bureau's American Community Survey, roughly 22% of the US population age five and older speaks a language other than English at home. In major tourist destinations the share of international visitors compounds that further.
- Institutions with temporary or traveling exhibitions where the audio needs to ship on opening day and disappear when the show closes.
The pattern in every case is the same: the curatorial team has more they want to say than the studio production pipeline will let them ship.
When does an AI audio guide not fit?
This is the section most vendor sites won't write. I'm writing it because the alternative — pretending the category fits every museum — is the kind of overclaim that erodes trust in the whole space.
Cases where I'd tell a curator to stay with what they have, or to use a different category of tool entirely:
- The audio guide is the artwork. If you've commissioned a specific artist or named voice to narrate the tour as part of the work — Janet Cardiff, an oral-history project, a community-voiced tour — that voice is the point. Generating a script from materials misses what the project is for. Use studio production, keep the voice.
- Very small, never-changing permanent collections. If you have twelve objects on permanent display and the wall text was written a decade ago and will be the same a decade from now, the production math for a one-time studio recording works fine. The case for AI is strongest where the collection moves.
- Compliance environments that prohibit generative tooling. Some institutional clients — usually government-affiliated or grant-restricted — have explicit rules against using generative AI in published material. Read your funder agreements. The category is moving fast and policies vary.
- Specialist scholarly tours where the writing is the contribution. A curator-authored, footnoted, peer-reviewed audio essay on a single work is a different artifact than an interpretation layer. AI drafting will not produce it, and it shouldn't try.
- Audiences who explicitly opt out of digital. Visitors who come to the museum precisely to disconnect — and they exist — still deserve wall text and a docent. The phone-delivered layer is additive, not a replacement.
For everything in between — which is most institutions — the question isn't whether the category fits. It's which platform fits, and what to negotiate for.
How does AI compare to traditional audio guides?
The short version: AI changes the production math, the language math, and the update math. It does not change the curatorial judgment, the editorial standard, or the institutional voice — those remain human work.
The traditional studio audio guide model assumes a production cycle measured in months, a per-language cost line, and an effectively frozen artifact once the studio session closes. Six months has long been described inside the industry as the practical floor for a custom mobile guide on the studio-and-handset model — the figure that matches what curators we work with report from their own RFPs. Per-language translation runs twelve to thirty cents per word at standard ATA rates, before re-recording.
What an AI platform replaces is the scheduling drag and the per-language cost line. What it does not replace:
| What stays human | What software does | | --- | --- | | Deciding what to highlight, what to skip, what to frame | Drafting a first pass from your sources | | Voice and register choices for your institution | Voicing the approved script | | Scholarly accuracy and ethical framing | Translating + revoicing across languages | | Reviewing every line that ships | Re-rendering on update | | Answering "what should this tour be" | Answering "how do we produce it" |
The longer comparison — with the year-one budget math, the voice-quality discussion, and the cases where studio production still wins — is the AI audio guide vs traditional audio guide spoke. It's the next read after this one.
What about hallucination?
The most common objection, and a fair one. If the platform is using a language model to draft, what stops it from inventing things about your collection?
Three things, in order of importance:
- Source-grounded generation. Drafts are produced from documents the curator uploads. The model is constrained to the supplied materials. Anything outside that boundary should be flagged or refused.
- Human approval before publish. This is the actual guarantee. The draft is a starting point; the curator edits and signs off before any visitor hears it. Same review obligation a museum has always had over its interpretive copy.
- Grounded visitor Q&A with refusal. On platforms that let visitors ask follow-up questions, the answer is generated at runtime from the same source set. When the platform can't ground an answer, the right behavior is to say so, not to confabulate. Test this in any demo.
The 2025 academic record on RAG and faithfulness — the faithfulness-in-RAG leaderboard work, HalluLens, the Stanford legal-RAG study — is unanimous that grounding reduces error rates substantially but does not zero them out. Which is exactly why the editorial step matters, and exactly why I wrote a separate piece on authenticity and AI in museum interpretation. A tool that drafts well still ships under a curator's name.
How many languages does a museum audio guide need?
Enough to serve the visitors actually in the building, plus the audiences the museum is trying to reach next. For most US institutions that's a Spanish track at minimum, given the 22% of the US population age five and older that speaks a language other than English at home per the US Census ACS, and often a meaningfully larger set in destination cities. For European institutions the floor is usually higher.
The interesting shift is that the per-language cost — historically the reason most museums shipped English-only — is now close to zero on AI platforms. Convo, as one example, ships ten languages from one approved English source on every paid tier. The cost equation has flipped: language coverage used to be a budget question; now it's almost entirely an editorial-review question. The constraint is having a reviewer for each language who can read the output before it ships, not having the studio time to record it.
For a deeper treatment of how many languages a specific museum actually needs, see Pillar 3 — Multilingual interpretation — when that hub is published.
What does an AI audio guide cost?
Pricing in the category is moving toward SaaS subscription, replacing the per-tour studio bill that defined the legacy model. The shape that's settling in across newer platforms:
- Subscription tiers by institutional size or visitor volume, typically monthly.
- Unlimited tours, languages, and edits included on most paid tiers — the unit economics that made per-tour pricing make sense in the studio era don't apply when the marginal cost of another tour is software-only.
- Pilots or free starter tiers so the institution can ship one tour end-to-end before signing.
- Enterprise / custom pricing for white-label, native apps, or procurement-heavy environments.
For specifics on what Convo charges, our pricing page is the source of truth and I won't drill into vendor-specific numbers here. For a procurement-grade breakdown of what to expect across vendors — and what to put in an RFP — see Pillar 2 (Buying & cost) when that hub is published.
The honest framing for a director comparing budgets: a year of an AI audio guide subscription typically costs less than producing a single traditional studio tour in two languages. The decision isn't usually between AI subscription and studio production for the same scope — it's between AI subscription with broad multilingual coverage and studio production for one English tour. Different products at the same price point.
What's the visitor experience like?
The shape of the visitor side has narrowed across vendors. Nearly all serious platforms now deliver via QR code to the visitor's own phone, opening a web player rather than requiring an app download. The reason is in the data: years of cross-industry research on cultural-organisation apps — summarised by Frankly Green + Webb and consistent with the rest of the industry literature — keep finding that the average museum app pulls visitor adoption in the low single-digit percentages. Rented handsets fare slightly better in some contexts but trend the same direction.
What a typical visitor does:
- Scans a QR code at the wall or label.
- Lands on a web player; selects language if prompted.
- Listens to the stop. Pauses, scrubs, listens again.
- On platforms that offer it, asks a follow-up question by typing or speaking.
- Walks to the next stop. Repeats.
What changes between platforms is what happens at step four. The broadcast-only platforms — where step four doesn't exist — are operationally similar to traditional audio tours, with the production economics rebuilt underneath. The conversational platforms (Convo is one) let visitors follow their own curiosity, with answers grounded in the same curator-supplied materials.
This is also where the visitor data starts to look interesting to directors. When visits are conversations rather than broadcasts, the museum can see — anonymously, in aggregate — what visitors are actually asking about. That's a different artifact than tour-completion analytics, and it shows up in the board slide a buyer carries upstairs.
For the broader shift in what visitors expect in 2026, see the note on the 2026 museum visitor.
How do you evaluate AI audio guide vendors?
Most of the vendor-selection drudgery is no different than buying any other piece of museum software. The category-specific questions worth pushing hard on:
- How is grounding implemented? Ask to see how the platform behaves when a visitor asks a question outside the supplied source set. The honest behavior is to decline; the dishonest behavior is to invent.
- What's the update latency? When you correct an attribution or fix a line, how long until visitors hear the new version? Anything more than a few minutes is fighting against one of the category's headline arguments.
- Who controls voice and tone? Can the institution tune the register? Can a curator override a generated line with their own prose? Can a section be marked "do not regenerate"?
- What's the language list, and how is each language reviewed? Ten languages is now table stakes on serious platforms. What's the workflow for getting a native speaker to review the Mandarin track before it ships?
- What happens to our reference materials? Are they used to train models? Are they shared with other institutions? Get the answer in writing.
- What's the migration story if we leave? Can we export the scripts, the audio files, the visitor analytics? Vendors that make this hard are betting on lock-in.
- Who's the named contact when something breaks? SaaS vendors that won't put a human on the phone aren't the right partners for an institution that ships to the public.
For the deeper procurement guide — pricing models, total cost of ownership, full RFP templates — see Pillar 2 (Buying & cost) when that hub is published.
What about docents, wall text, and the rest of interpretation?
Nothing in this category replaces a docent. Docents read the room, take questions in the moment with their full expertise, and model how to look at a work. The right comparison isn't AI audio guide versus docent — it's AI audio guide versus the 99% of visits that don't get a docent.
Wall text is similar. Wall text is the floor every museum should clear; an audio layer is what visitors who want more can reach. The category doesn't compete with wall text; it stands on top of it.
The cases where I've watched institutions get the integration right share a pattern: the interpretation team treats the AI audio layer as one channel among several, with the same editorial standards as the others. The wall text, the docent tour, the audio guide, the digital signage, and the school program all reference the same underlying scholarship and the same institutional voice. The platform under the audio is invisible to the visitor. The substance is what they remember.
Frequently asked questions
Is an AI audio guide just a chatbot?
No. The primary artifact is a curator-approved audio tour. Some platforms add the ability for visitors to ask follow-up questions, but the tour itself is a produced, edited, voiced piece of interpretation — not a chat session. The chatbot framing is the wrong mental model and usually a sign of a vendor that doesn't understand the museum context.
Will visitors actually use it?
The data on QR-launch usage is encouraging where the experience is good and the wall card is well placed. The data on native app downloads — consistently in the low single-digit percentages across the published industry research (Frankly Green + Webb) — is what kills app-based guides. The web-via-QR pattern that most serious platforms now use clears that threshold by a wide margin. Adoption inside the gallery is mostly a wayfinding and signage question, not a technology question.
Does it work offline?
Most platforms cache audio for the duration of the visit once a tour is loaded, so a weak gallery signal doesn't break playback. Conversational features that require a live model call usually need connectivity. If your galleries have known dead zones, ask vendors specifically how their player handles that.
How long does it take to get a tour live?
On modern AI platforms, weeks rather than months. The bottleneck is almost always the curator's review pass, not the production tooling. A small pilot — one gallery, one language — can ship in days. A full multilingual launch across a permanent collection usually takes one to three months of curator review time, depending on how many objects are in scope.
What about accessibility?
The web-via-QR delivery pattern is, in most respects, better for accessibility than a rented handset: it works with the visitor's own assistive settings, runs through their existing screen reader, and inherits their preferred text size. The audio layer itself supports blind and low-vision visitors who can't read wall text. For a fuller treatment, see Pillar 4 (Accessibility & inclusion) when that hub is published.
Do we lose our institutional voice?
Only if you let the defaults ship. The drafts are seeded from your materials, in your register, and edited by your curators. The platform should be tunable to your voice, not impose a generic one. If a vendor's demo sounds like every other museum's demo, that's the vendor problem, not the category problem.
How do we disclose this to visitors?
The field is still settling on a standard. Most US institutions I've talked with are comfortable with a one-line credit on the tour page that names the curatorial team as authors and the platform as the production tool. The substance — that a curator authored, reviewed, and approved every line — is what makes the disclosure honest.
Continue reading
The next read in this pillar is AI audio guide vs traditional audio guide: what actually changes — a side-by-side on production cost, time to launch, voice quality, multilingual reach, content control, and the cases where the traditional model still wins.
If you want the broader argument for why visitor expectations have shifted under museums' feet, the essay on the 2026 museum visitor is the place to start. For the most common objection to the category — that machine-assisted interpretation threatens authenticity — there's a separate piece on authenticity and AI.
About the author
Eric Duffy is the founder of Convo, a platform that lets museums and cultural institutions publish multilingual audio tours their visitors can have a conversation with. He writes about how museums could afford to be more ambitious with interpretation, drawing on discovery conversations with curators, directors, and education leads at small and mid-size US museums. Reach him at eric@convo.app or on LinkedIn.