ALL RESOURCES
PILLAR 03 · MULTILINGUAL INTERPRETATION

Multilingual audio guides: a museum's guide to reaching every visitor.

A practical guide to how many languages a museum audio guide actually needs, what changes when AI rewrites the production math, and where the multilingual story is over-claimed. For curators and visitor-experience leads planning language coverage in 2026.

ERIC DUFFY·FOUNDER·15 MIN READ·UPDATED 2026-05-29

If you walk through any major US museum on a Saturday afternoon and listen, you will hear at least three or four languages on the gallery floor. The audio guide, in most cases, will be available in one — sometimes two, occasionally five at a destination institution that put in the budget. The gap between who is in the building and who the interpretation is written for is the largest, quietest failure in museum visitor experience today.

I'm Eric Duffy. I run Convo, a platform in this category. We ship ten languages on every paid tier and I have an opinion about why that floor matters. I have also tried to be honest in this piece about where the multilingual story gets oversold — including the difference between a translated track and a localized one, and the cases where machine translation is the wrong answer entirely. If you want a vendor-neutral primer on how to think about language coverage, read on.

How many languages does a museum audio guide actually need?

Enough to serve the visitors who are already in the building, plus the audiences the museum is trying to reach next. For most US institutions that floor is at least English plus Spanish, given that more than one in five people age five and older speaks a language other than English at home per the US Census Bureau's American Community Survey, with Spanish accounting for roughly 61% of the non-English share. In destination cities the floor rises quickly — AAM has cited the share of people speaking a language other than English at home as approaching half the population in New York and Los Angeles. European institutions, especially those near rail hubs, usually plan for a floor of six to eight.

The interesting shift is that the right number is no longer mostly a budget question. On AI platforms, the per-language cost has collapsed close to zero. What's left is an editorial question: how many of those languages do you have a reviewer for?

Which languages should we prioritize?

The honest framing has three inputs. Start with the languages of the community in your catchment area — the ACS gives you a tract-level read, and most regional planning organizations publish a simpler summary. Layer on the languages of your inbound tourism — NYC, for example, took in roughly 12.9 million international visitors in 2024 with the UK, Canada, France, Brazil, Italy, China, and Spain each contributing hundreds of thousands, per the NYC mayor's office tourism summary. Finally, ask the museum itself: which communities is the strategic plan asking you to reach next?

For most US museums the answer ends up as English, Spanish, and a destination-specific shortlist of three to five more. For a Met-scale institution serving global tourism, ten languages becomes table stakes. The actually-hard part isn't picking the languages. It's the review workflow — who, on staff or on contract, reads each language before it ships. Section below on review is where most institutions stumble.

For a deeper essay on serving multiple visitor types from one collection, see the note on one collection, many audiences — language is one of several dimensions where the same curatorial corpus has to meet very different people.

Translation versus localization: what's the difference?

Translation moves the words across. Localization moves the meaning across. The first is a clerical task that machines do well. The second is editorial work that depends on cultural fluency, and machines do it less well than the marketing pitch suggests.

A literal Spanish translation of an English wall-text essay will read as a translation. A localized version reads as if it were written in Spanish in the first place: the idioms, the register, the references, the rhythm of how the language describes art in the local Spanish-speaking culture of the audience you're reaching. Translation services trade groups have written about this for decades. The point is sharpest in museum contexts, where the writing has been edited within an inch of its life in English — every choice of word matters, and a lazy translation collapses what the curator built.

Best-in-class multilingual museums treat translation as the first pass and localization as the editorial pass on top. The AI doing the first draft does not change this. It changes the cost of producing the first draft to roughly zero. Which means the budget for the second pass — the native-speaker editor — should go up, not down.

How does AI change the multilingual production math?

Traditional multilingual audio production added a per-language line item that compounded across the project. The ATA cites translation rates of twelve to thirty cents per word for professional services before any voice talent or studio time. A ninety-stop tour in English, Spanish, Mandarin, and German typically meant four separate studio sessions, four sets of voice talent, and four rounds of synchronization. Six months has long been described inside the industry as the practical floor for a custom mobile guide on the studio-and-handset model. Most museums looked at the bill and shipped English-only.

On modern AI platforms, the same approved English script is re-voiced in eight to ten languages without re-booking anything. The marginal cost of an additional language is software-only. Convo, as one example, ships ten languages — English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, and Arabic — on every paid tier, and a corrected script re-renders in all ten in roughly a minute. That number used to be a budget meeting.

What this lets a museum do is not "save money on translation." It's "ship the same number of dollars on review and localization, spread across languages that previously had no track at all." The decision moves from "can we afford a second language" to "which native-speaker reviewers do we want to hire."

Where multilingual AI falls short

This is the section I want any director reading this piece to take seriously. The category over-claims here, and the over-claim is the single fastest way to lose trust with curators who care about voice.

Cases where machine-assisted multilingual interpretation does not carry the day:

  • Poetry, lyrics, and oral history. When the source text is rhythmically or culturally crafted in the original language, translation by any means — human or machine — produces a different artifact. Use studio talent and a translator who is also a writer in the target language.
  • Deeply culturally specific narration. A track on Día de los Muertos for a Mexican-American audience in East LA, narrated in Spanish, isn't a translation of an English track. It's a piece of writing for that audience in that register. The AI first draft will be flat. The hand-written version will resonate.
  • Indigenous languages, especially low-resource ones. Most large language models and TTS systems are weak on indigenous languages of the Americas, the Pacific, and elsewhere. If you need a Lakota or Diné track, the right answer is community-led production, not platform translation.
  • Voices that are part of the work. If the artist or oral-history subject is the narrator, the voice is the artifact. Don't replace it.
  • Anything the curator can't read. A track that ships without a fluent reviewer is a liability, full stop. The AI made it cheap to produce; that doesn't mean it's cheap to be responsible for.

This isn't a rejection of the category. It's the editorial fence around what the category can credibly do. The best multilingual museums treat machine-assisted translation as the standard floor and reserve studio production for the specific tracks that need it.

For the broader argument on where machine assistance fits ethically inside museum interpretation, see authenticity and AI in museum interpretation.

What does a multilingual review workflow look like?

The single most useful operational shift when moving from one-language production to ten-language production is treating review as a discrete role rather than a step inside the curator's job. The curator approves the English. A separate native-speaker reviewer approves each non-English track before it ships.

In practice the workflow that works for most institutions:

  1. Curator approves the English master. This is the source of truth — every other language regenerates from it.
  2. Platform produces draft tracks in each target language from that approved master.
  3. Per-language reviewer (staff, contractor, or community partner) reads the transcript and listens to the audio in their language. They flag literal-sounding phrasing, cultural misses, mispronunciations of proper nouns, and register problems.
  4. Curator and reviewer co-edit the script. The platform re-renders in roughly a minute.
  5. Track ships. Any future English edit cascades to all languages and the reviewer re-checks only the affected stops.

The bottleneck is finding the reviewers, especially for languages where the museum doesn't already have staff. Common sources: heritage-language faculty at local universities, professional translation agencies with museum portfolios, community organizations and language schools in your service area, and — for major destination languages — your sister institutions abroad, who often know exactly the freelancer you need.

The reason this matters more than the production tooling: the platform can ship a draft of any language in seconds. The institution can only ship a language it can stand behind. The review workflow is what closes the gap.

Who is responsible for translation quality — the museum or the platform?

The museum. Always. The platform produces a draft. The institution publishes.

This is the same line as the editorial responsibility for the English text, and the same line museums have held for catalog copy, wall text, and brochure translation for decades. A platform that suggests otherwise is selling something a museum shouldn't buy. A platform that hides what language the draft was generated by, or what was machine-translated versus human-written, is selling something even worse.

The clean shape is for the platform to mark every generated track as machine-drafted, surface the diff when a reviewer edits it, and version the published track so the institution can demonstrate exactly what shipped. Museums are accountable institutions; the tooling under them should make that accountability easier, not harder.

Is multilingual access a legal requirement for museums?

For most US museums the answer is "partly, and rising." Title VI of the Civil Rights Act of 1964 prohibits national-origin discrimination, and a 2000 executive order extended that to language access — recipients of federal financial assistance are expected to provide meaningful access to people with limited English proficiency. The Institute of Museum and Library Services, which is the federal funding pipeline for most US museums, maintains its own Language Access Plan and expects grantees to reduce LEP-based barriers to access.

The frame that matters: if the museum takes federal money — IMLS grants, NEH grants, federal capital funds — language access is not a marketing question. It's a Title VI question. Most institutions are nowhere near the edge of the law here, but the law is the reason the floor exists. The audio guide is one of the most visible places where that floor gets crossed or doesn't.

A note for non-US readers: equivalent obligations exist under most European national-language regimes, and in Canada under the Official Languages Act. Check your funder agreements. The category is moving fast and policies vary.

How fast can a multilingual tour ship on an AI platform?

The technical answer for a curator-approved English script: minutes for the audio render in any single language. The operational answer for a tour ready to publish in production: as fast as the slowest reviewer.

A useful benchmark from real launches: a small pilot with one gallery in two or three languages can ship in a week of working time. A full permanent-collection launch in eight to ten languages takes one to three months of curator and reviewer time — most of it spent in the English review and the per-language localization passes, not in the production tooling. The studio-production benchmark for the same scope is six months to a year — the figure that's been described inside the industry as the practical floor for a long time.

The right framing for a director: AI multilingual production removes the production bottleneck. It does not remove the editorial bottleneck. Plan staffing around the second.

What about visitor Q&A in non-English languages?

The frontier question. Most serious AI audio platforms now support a conversational layer where visitors can ask follow-up questions at each stop, grounded in the same curator-approved source materials. The quality of that conversation in non-English languages varies widely, and the variation tracks the underlying language model's training distribution.

For Spanish, French, German, Italian, Portuguese, Japanese, and Korean, modern multilingual models are strong enough that grounded Q&A is usable. For Chinese (especially traditional characters and Cantonese) and Arabic (especially regional dialects), quality varies by stop and topic. For lower-resource languages, the broadcast-only multilingual track usually outperforms the live Q&A — at least for now.

If the visitor Q&A layer is part of what you're buying, ask vendors to run a live demo in each of your target languages, with a few questions you've prepared in advance, before signing. Treat the conversational layer as a per-language capability, not a global one.

What's the disclosure standard for multilingual audio?

The field is still settling here. Most US institutions I've talked to are comfortable with a one-line credit on the tour page that names the curatorial team as authors, names the platform as the production tool, and notes which languages were reviewed by native speakers. A short list of credited per-language editors — like the translation credits on the colophon of a published book — reads as confidence, not apology.

The disclosure that should never appear is silence on the question. Visitors who notice translation quality are usually paying close attention; an honest disclosure earns goodwill in exactly that audience.

For the broader argument on disclosure and trust, see authenticity and AI in museum interpretation.

How does this fit into the broader AI audio guide picture?

Multilingual interpretation is, for most institutions evaluating this category, the strongest argument for switching. The production math on traditional studio multilingual is what kept most museums English-only for decades. AI changes that math without changing the curatorial standard, the editorial responsibility, or the institutional voice — which means it changes who the museum can credibly serve, in their first language, on opening day.

If you want the pillar-level treatment of AI audio guides as a category — production model, hallucination, visitor experience, vendor evaluation — start with the AI audio guides hub and the spoke on AI audio guides versus traditional. For the visitor-side argument on what people in galleries now expect, read the note on the 2026 museum visitor. For specifics on what Convo charges, see pricing.

Frequently asked questions

English and Spanish, in most cases. Roughly 22% of the US population age five and older speaks a language other than English at home per the latest ACS data, and Spanish accounts for about 61% of that share. A second language costs essentially nothing to add on AI platforms — the cost is the reviewer, not the production.

For mainstream language pairs — English/Spanish, English/French, English/German, English/Italian, English/Portuguese, English/Japanese, English/Korean — modern systems produce drafts that are publishable after editorial review. For lower-resource languages, indigenous languages, or culturally specific narration, the AI draft is a starting point at best. Plan the review workflow accordingly.

Translation moves the words across. Localization moves the meaning across — the idioms, the register, the cultural references, the rhythm. A literal translation reads as foreign. A localized track reads as if written for that audience from the start. AI does the first; humans do the second.

Common sources: heritage-language faculty at local universities, professional translation agencies with museum portfolios, community organizations and language schools in your catchment area, and — for major destination languages — sister institutions abroad. Many museums also build a roster of paid per-language editors who review across exhibitions over time.

Sign language and audio description for blind and low-vision visitors are separate, equally important questions, treated as accessibility rather than multilingual. They generally require dedicated production — ASL on video, audio description by trained describers — not AI translation. See the Accessibility pillar when its hub is published for a full treatment.

For mainstream target languages, modern neural TTS is hard to distinguish from a studio recording in blind tests. For lower-resource languages the quality gap is more noticeable. Ask any vendor to play their voice samples in your target languages, on your phone, in the gallery — not just in English on a demo laptop.

The visitor asks a question in their language at a stop. The platform retrieves the relevant section of the curator-approved source materials, answers in the visitor's language, and (on serious platforms) refuses to answer when the source set doesn't cover the question. Quality tracks the underlying multilingual model — strong in major European and East Asian languages, weaker in lower-resource ones.

If you receive federal funding, Title VI and the 2000 executive order on language access apply — you're expected to provide meaningful access to limited-English-proficient visitors. IMLS publishes its own Language Access Plan and expects grantees to operate inside it. Most museums are well clear of the edge of the law, but the legal floor is why an audio guide that ships English-only is increasingly a question.

Continue reading

For the foundational picture of what an AI audio guide is and how the category works, the AI audio guides hub is the place to start. For the production-math comparison against traditional studio-produced audio, see AI audio guides versus traditional audio guides.

For the visitor-side essays that ground this work: the 2026 museum visitor on what people walk in expecting; one collection, many audiences on serving very different visitors from a single curatorial corpus; and the audio guide is not the product on what museums are actually buying when they buy this category.

For pricing specifics, Convo pricing is the source of truth. For the product itself, see Convo.


About the author

Eric Duffy is the founder of Convo, a platform that lets museums and cultural institutions publish multilingual audio tours their visitors can have a conversation with. He writes about how museums could afford to be more ambitious with interpretation, drawing on discovery conversations with curators, directors, and education leads at small and mid-size US museums. Reach him at eric@convo.app or on LinkedIn.

WHAT WE’RE ASKING

Pick one gallery.
Give us two weeks.