BACK TO MULTILINGUAL INTERPRETATION
PILLAR 03 · MULTILINGUAL INTERPRETATION

Re-voicing a museum audio tour across ten languages without a studio.

The operational mechanics of re-voicing an approved English tour across ten languages on an AI-narrated platform — the workflow, where human review is still non-negotiable, and how it compares to the studio-and-cast model it replaces.

ERIC DUFFY·FOUNDER·9 MIN READ·UPDATED 2026-05-29

Re-voicing a tour used to be the single most expensive operational decision a museum made about its audio program. The studio booking, the casting search for a native speaker in each target language, the script translation, the recording, the edit, the mastering, and the redeployment to handsets — every one of those steps had to happen again, for every language, every time the script changed. So the script didn't change. And if a tour shipped in two languages, it shipped in two languages forever.

That math is what kept most US museums on English-only audio. It's not what the curators wanted; it's what the production model could afford. This piece walks through what re-voicing actually looks like on a modern AI-narrated platform — including Convo — and where the workflow still depends on a human reviewer per language. It also has an honest section on the cases where booking a studio is still the right answer.

I'm the founder of Convo, so this is a perspective from inside one of the platforms in the category. I've tried to write the operational walkthrough I'd want before signing a multilingual commitment.

What does re-voicing a museum audio tour actually mean?

Re-voicing is the production step where an approved script becomes finished audio in one or more languages. In the traditional model, re-voicing is synonymous with a studio session: a voice actor reads the approved script into a microphone, an engineer edits and masters the takes, and the resulting file is the canonical recording. Every language is a separate session, separate talent, separate edit.

On an AI-narrated platform, re-voicing is a deterministic step that runs against an approved script. The platform regenerates audio across every supported language from a single source of truth — usually the English script the curator signed off on. The language layer is mechanical; the editorial layer (translation choices, named-entity pronunciation, cultural framing) is where humans still belong.

The shift isn't really about voice quality. It's about which artifact is canonical. In the studio model, the recording is canonical and the script is a piece of paper that produced it. In the AI model, the script is canonical and the recording is regenerable from it. That inversion is what makes multilingual coverage affordable.

How did re-voicing work in the studio-and-handset model?

Studio re-voicing is a serial, calendar-bound process whose costs scale linearly with language count. For a typical 30-stop museum tour, the sequence runs roughly: translate the English script into the target language (a week, sometimes more for review), cast a native voice actor (two to six weeks of auditions and negotiation), book studio time (two to six weeks of lead time), record the session (one to three days plus pickups), edit and master (one to two weeks), redeploy to handsets or the app build (one to two weeks). Then repeat for every additional language.

The hard costs add up quickly per language: translation, casting, studio time, talent fees, editing, mastering. Voice talent itself sits in the $180–$250 per-finished-hour band for production-house and publisher-grade audiobook work (Backstage GVAA explainer, 2024), with corporate and museum narration often higher when usage rights are factored in.

But the harder cost is the calendar. Six weeks to add a language means a temporary exhibition with a ten-week run cannot affordably ship in any language but the one it opened in. The legacy timeline silently determined which collections got multilingual audio at all — almost none.

What does the AI-narrated re-voicing workflow look like?

The mechanical sequence is: edit the English script, translate it across the language set, regenerate audio per language, review, publish. In practice on Convo this runs roughly as follows.

A curator edits or finalizes the English script in the admin portal. They mark the tour as approved in English. The platform produces a translation per target language using a model tuned for the museum register — formal where formal is right, conversational where the source script chose conversational. The platform generates audio for every language using neural TTS. The output across all ten languages is ready for review in roughly sixty seconds of wall-clock time. The curator listens to the named-entity passages and the culturally loaded sections in each language, makes any wording fixes, and publishes.

The wall-clock budget breaks roughly into: machine work, about a minute; reviewer work, between thirty minutes and several hours per language depending on tour length, reviewer fluency, and how culturally sensitive the material is. For a thirty-stop secular collection translated into Spanish and French by an in-house bilingual staffer, this can finish inside a single afternoon. For an exhibition involving religious objects, ritual contexts, or contested historical naming, the review can — and should — take days.

Where in the workflow does human review still matter?

Three places, predictably: named entities, religious and ritual terminology, and region-specific cultural framing. Most of an interpretation script is descriptive prose that a modern translation model handles cleanly. The places where it gets things subtly or unsubtly wrong cluster tightly around the same three failure modes.

Named entities. Artist names, donor names, place names, and titles of works are the single biggest source of error, because they're the cases where the model has to choose between transliterating phonetically and using an established conventional rendering. Academic work on machine transliteration between English, Arabic, Chinese, and Japanese has documented this for two decades (ACL: Linguistic Issues in Machine Transliteration of Chinese, Japanese and Arabic Names, 2016); the failure modes have changed in degree but not in kind. "Hokusai" has a settled Japanese rendering; a transliteration model can still produce a phonetic approximation that no Japanese reader would recognize. A curator with the language has to catch these.

Religious and ritual terminology. Translating across religious traditions requires more than lexical accuracy. A model can render "the Virgin Mary" into Spanish, French, and Italian without trouble; it has a harder time choosing between competing renderings of "the Buddha" in Korean Buddhist contexts versus secular art-historical ones, or handling terms in Islamic art whose English usage drifts from how Arabic-speaking visitors would expect them framed.

Region-specific cultural framing. A passage that reads fine in English about a colonial-era acquisition might land very differently when translated into a language whose speakers were on the other side of that history. The translation can be technically perfect and the framing still wrong for the audience. This is the failure mode that most needs a native-speaker reviewer with cultural context, not just bilingual fluency.

The rest of the script — descriptions of materials, dates, dimensions, comparative art-historical context — does not need per-language curator review in the same way. It needs spot-checking, not auditing. The point of the workflow is to concentrate the human attention where it actually changes the visitor's experience.

How does this compare to the traditional studio approach?

A side-by-side on the same thirty-stop tour, translated and re-voiced across nine additional languages:

| Step | Traditional (studio + cast) | AI-narrated (Convo) | |---|---|---| | Translate script | 1 week per language, sequential or parallel | Seconds per language | | Cast voice talent | 2–6 weeks per language | Not applicable (TTS) | | Book studio | 2–6 weeks lead time per language | Not applicable | | Record | 1–3 days plus pickups | ~60 seconds across all nine | | Edit + master | 1–2 weeks per language | Not applicable | | Curator / language review | Pre-session only; locked after print | 30 min – several hours per language, in-platform | | Per-language hard cost | Translator + voice talent + studio + edit, per language | Included in platform fee | | Total wall-clock to ship 9 languages | 4–9 months | Hours to days |

The point of the table isn't that AI is faster. It's that the cost of being wrong is different. In the studio model, a fix after launch means re-booking the same talent (often impossible), re-cutting the audio to match the original mix, and redeploying. So fixes don't happen. In the AI model, the curator finds the mispronunciation in the Korean version a week after launch, types in the correct rendering, and re-voices that language in seconds. Same-day correction is genuinely a different operating posture — the connected piece on same-day museum tour updates walks through what it changes about the day-to-day.

What does "re-voicing across ten languages in sixty seconds" actually mean?

It means the platform-side audio generation runs in roughly sixty seconds of compute time across all ten supported languages from one approved English source. Convo's supported language set is English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, and Arabic. From the curator's seat, the experience is: click "re-voice all languages," wait while a progress indicator runs, listen to the result.

What that single number is doing in practice: invoking a translation step per non-English language, invoking a neural TTS engine per language, and reassembling per-stop audio files into the visitor-facing player. The actual computation is parallelized across languages, which is why ten doesn't take ten times as long as one.

What that single number is not doing: bypassing the curator. The output of the sixty-second run is a draft. Publishing requires the curator to approve. We've been explicit about this since our first deployment because it's the part platforms in the category most often gloss over — the speed claim is real, the bypass-the-human implication isn't. The companion piece on translation versus localization for museums goes deeper on where the reviewer's judgment is doing the heaviest lifting.

Where the studio model still wins

Three cases where booking a studio and casting native voice talent is still the right answer:

1. When the named voice is the curatorial offer. If a specific actor, a curator, an artist, or a community elder is narrating because their voice is part of the meaning — an Indigenous community member voicing a tour about their own collection, a named artist narrating their own retrospective — there is no AI substitute. The voice is not the delivery; it's part of the content.

2. When a language is outside the platform's supported set and the audience matters. Convo ships ten languages. If you need Hindi, Swahili, Tagalog, or a regional dialect with a small native-speaker community on staff, a one-off studio session with a community-recruited native speaker may be the right move — possibly alongside the platform-supported languages for the rest of the tour.

3. When the audio itself is the artwork. Sound-art tours, oral-history projects, or commissioned interpretive pieces by named producers fall outside this analysis entirely. They aren't audio guides in the operational sense; they're commissioned audio works that happen to function as one.

For everyone else — the museum trying to serve Spanish, Mandarin, French, Korean, and Arabic speakers across a permanent collection without a six-figure production line — the script-canonical, regenerable-audio model is what makes the math work.

What's the right next step?

If you're sizing up a multilingual program for the first time, the most useful exercise is to look at the actual language demographics of your visitors (or your catchment area) and decide which three to five matter most. Most museums find that two to four languages cover the overwhelming majority of non-English-first-language visitors. From there, the platform decision is whether you can ship those languages and update them on a curator's schedule rather than a studio's.

See the multilingual interpretation pillar guide. For the broader category overview, see the AI audio guides pillar. To look at how this works inside the product, see Convo's product page.

FAQ

The machine-side audio generation runs in roughly sixty seconds across all ten languages on Convo. The total elapsed time including human review depends on tour length and how culturally sensitive the material is — a secular collection reviewed by an in-house bilingual staffer can finish in an afternoon; an exhibition involving religious or contested historical content can take several days, and should.

Ideally yes, at least for the named-entity and culturally loaded passages. For languages where you don't have in-house fluency, most museums use a per-language freelance reviewer for the first publish and spot-check thereafter. The cost is modest compared to a studio session and bounds the risk of a mispronunciation reaching a visitor.

On a platform like Convo, the curator edits the script, re-voices that language only, and the corrected audio is live for visitors in seconds. This is the operating difference from the studio model, where a post-launch correction usually doesn't happen at all.

Reasonably well in supported languages, less so without help. Modern neural TTS handles common name patterns cleanly. For artist names, place names, and titles from outside the language's typical lexicon — a Korean rendering of a French Impressionist's name, an Arabic rendering of an Edo-period Japanese artist — the most reliable fix is for the curator to adjust the script's spelling or context so the model lands on the right pronunciation, then re-voice.

Re-translating produces new text in another language; re-voicing produces new audio from that text. Both happen as part of the same flow on AI-narrated platforms. The reason to separate the vocabulary: the editorial decisions live in the translation step, which is where the language reviewer's judgment matters; the voicing step is mostly mechanical from there.

On most platforms in the category, yes. Convo accepts an existing English audio file as the canonical English track and uses the approved script as the source of truth for the other nine languages. This is a common path for museums that produced a named-voice English tour previously and want to add multilingual coverage without redoing the original.

The verdict

Re-voicing used to be the operational reason multilingual museum interpretation was a luxury. The studio session, the cast search, the calendar, the per-language cost — those costs added up to a structural ceiling on how many languages a museum could afford to speak. The AI-narrated workflow doesn't eliminate the editorial work, and it shouldn't try to. What it eliminates is the production overhead that made the editorial work uneconomical to do at all for the seventh or eighth language. That's the whole shift. Once the script is canonical and audio is regenerable, multilingual coverage stops being a budget line and becomes an editorial one — and that's an editorial question most museums are well-equipped to answer.

If you want to see how this looks in the product, Convo's product page shows the workflow end-to-end and the pricing is published in full. If you want the surrounding context first, the multilingual interpretation pillar and the AI audio guides pillar are the right next reads.


About the author

Eric Duffy is the founder of Convo, a platform that lets museums and cultural institutions publish multilingual audio tours their visitors can have a conversation with. He writes about the production economics of museum interpretation from inside the category — drawing on RFP data, discovery calls with curators and directors, and the operational mechanics of both the studio-and-cast model and the AI-narrated model. Reach him at eric@convo.app or on LinkedIn.

WHAT WE’RE ASKING

Pick one gallery.
Give us two weeks.