BACK TO AI AUDIO GUIDES
PILLAR 01 · AI AUDIO GUIDES

AI audio guide vs traditional audio guide: what actually changes.

An honest, dimension-by-dimension comparison of AI-narrated audio guides against the legacy studio-and-handset model — cost, time, languages, control, accessibility, and the cases where the legacy model still wins.

ERIC DUFFY·FOUNDER·12 MIN READ·UPDATED 2026-05-29

The short version: an AI audio guide and a traditional audio guide produce the same artifact for the visitor — a piece of narrated interpretation tied to a stop on a tour — but almost everything upstream of that artifact is different. Cost, time-to-launch, the number of languages you can afford, how often you can update content, and who carries the equipment. This piece walks through what actually changes, dimension by dimension, with the numbers I've seen on the inside of both models. It also has an honest section on where the traditional studio-and-handset model still wins, because there are real cases.

I'm the founder of Convo, so I have a horse in this race. The dimensions below are the ones our prospective customers ask about on every first call, and I've tried to give them the answer I'd want before signing a five-figure annual contract.

At a glance

| Dimension | Traditional (studio + handset) | AI-narrated (phone-based) | |---|---|---| | Production cost | $30,000–$150,000+ per tour, pre-translation (Convo, 2026) | $15,000–$50,000/year platform fee, unlimited tours | | Time to launch | 6–18 months (drafting, casting, studio, edit, QC) | 2–6 weeks (upload sources, edit drafts, publish) | | Voice quality | A specific human in a treated studio | Generated voice; 2025-era models near-indistinguishable in short narration | | Languages | $5k–$15k per additional language, weeks each | Ten languages from one English source, ~60 seconds to re-voice | | Update cadence | Re-book studio, re-record, re-press; rarely done | Edit script in admin; new audio in seconds | | Accessibility | Hardware accommodations vary; transcript often absent | Phone-native screen reader, text, font sizing, captions for free | | Hardware / logistics | Rented handsets, sanitization, charging, theft, repair | QR code; visitor's own phone; no inventory |

The rest of this piece unpacks each row and ends with the cases where the left column still wins.

How do the production costs compare?

The AI model wins by an order of magnitude on year-one production, and by more than that on year three. Traditional studio production for a full museum audio tour — scripting, voice casting, recording, editing, sound design, mastering, multilingual versioning — typically lands between $30,000 and $150,000 per tour before hardware, depending on length, language count, and whether the museum hires its own talent or uses a turnkey agency. We publish that range on our about page; it matches what curators tell us they were quoted in their last RFPs from the legacy studio-and-handset vendors.

That number breaks down roughly as follows: scriptwriting at $150–$300 per finished minute, voice talent at the $200–$275 per-finished-hour audiobook floor (Voice Over Resource Guide, 2024), studio time at $150–$250 per hour with at least a 2:1 studio-to-finished ratio, plus editing, sound design, and mastering. Then multiply most of that by every additional language.

The AI-narrated model collapses those line items into a software subscription. Convo's published pricing — Studio at $1,200/month and Institution at $3,500/month — covers unlimited tours, unlimited languages, and unlimited edits. Other modern platforms in the category sit in the same shape: a platform fee rather than a per-tour production budget.

A five-year total cost of ownership comparison for a mid-size museum producing one 30-stop English tour plus three additional languages looks roughly like this:

| Year | Traditional | AI-narrated (Institution tier) | |---|---|---| | Year 1 | $80,000 production + $20,000 handsets | $42,000 | | Year 3 | + $25,000 (re-record one language, content update) | $42,000 | | Year 5 | + $40,000 (refresh + handset replacement) | $42,000 | | 5-year total | ~$165,000 | ~$210,000 |

Note the catch in row five: at the five-year mark, the traditional model is cheaper in raw dollars — if you never refresh content and never add languages. The moment either of those assumptions breaks (and they almost always do), the AI model pulls ahead and stays ahead. Add a fourth language, refresh one wing, fix a single attribution error, and the gap closes inside a year.

How long does each take to launch?

The traditional model is measured in quarters; the AI model is measured in weeks. A typical traditional tour runs 6–18 months from kickoff to opening day. The bottleneck is rarely the writing — it's the sequential, schedule-dependent work: voice casting (4–8 weeks), studio booking (2–6 weeks lead time), recording and pickups (2–4 weeks), edit and master (2–4 weeks), then the same cycle repeated per language.

The AI-narrated model compresses this to 2–6 weeks. The pieces that took months become minutes: a draft in roughly 90 seconds from uploaded reference materials, re-voicing across ten languages in roughly a minute. The remaining weeks are the work that doesn't compress — curator review, factual checking, internal sign-off. We've watched institutions go from a signed contract to a live, multilingual tour in under three weeks when their reference materials were already organized.

The point isn't that AI is faster. The point is that "faster" changes what's possible. A traveling exhibition with a ten-week run cannot justify a nine-month production cycle. A rotating gallery that re-installs three times a year cannot wait six months per refresh. The legacy timeline silently determined which collections got audio at all.

Is the voice quality actually comparable?

A 2025-era generated voice in short-form narration is, for most listeners, indistinguishable from a competent studio read — but not from a great one. This is the dimension where I have to be honest about the trade-off, because curators are right to ask.

Modern neural TTS (the engines behind platforms like ElevenLabs, Cartesia, Inworld, and the models in our own stack) handles 30-to-90-second pieces of museum narration well: appropriate pacing, breath, and emphasis on the right word. In informal double-blind testing we've done with curators on neutral text, the failure rate (correctly identifying which is the human read) sits near 50%, i.e., chance.

Where generated voice still loses: emotional range across a long arc, idiosyncratic readings that a specific actor brings (Werner Herzog narrating a piece on volcanism is not interchangeable with a TTS read of the same script), and any case where the voice itself is the curatorial choice. A studio read by a specific person you cast is not the same product as a generated read. It is, however, in most cases, the same artifact from the visitor's point of view at the stop.

Verdict: comparable for default narration; the traditional model wins outright when the named voice is the point.

How many languages can you actually afford?

This is the dimension that has changed the most, and it changes the math of every other dimension. The traditional model adds languages linearly: each additional language is roughly 60–80% of the original English production cost, because the script has to be translated, a native voice has to be cast, a studio booked, and the audio edited and mastered. The realistic ceiling for a mid-size museum is two or three languages.

The AI model adds languages from one source. Convo ships ten — English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic — and re-voices an entire tour across all of them in roughly 60 seconds. The marginal cost of the tenth language is effectively zero.

Why this matters: roughly 20% of the US population speaks a language other than English at home (American Alliance of Museums, 2022), and that share is substantially higher in the cities where most museums sit. Spanish, Mandarin, French, and Korean are the four languages we see most-requested by US institutions. The legacy model could only ever justify the first one, and only for institutions large enough to absorb a six-figure production line. The AI model makes "speak to every visitor in their first language" a default rather than a stretch goal.

Who controls the content, and how often can it change?

Editorial control is comparable between the two models, but update cadence is not. In both, a curator is the author of record. In the traditional model, a writer drafts, the curator edits, the actor reads what the curator approved, and the studio cuts to that approved take. In the AI model, the curator uploads reference materials, the platform drafts from those sources, the curator edits the script line by line, and the audio is generated from the approved script. The chain of approval is the same; the bottleneck moves.

What's genuinely different is what happens after launch. A traditional tour, once pressed, is effectively frozen — re-recording a single stop means re-booking the same talent (often impossible six months later), re-engineering the audio to match the original mix, and re-deploying to handsets. Most museums I've talked to have at least one tour stop they know is wrong but can't justify fixing. The AI model lets a curator change the script in the admin and re-voice in seconds. A correction is not a project; it's a Tuesday.

What about accessibility and hardware?

The phone-based AI model wins on accessibility almost by accident, and wins on hardware logistics outright. A traditional handset is its own accessibility problem: the device's screen reader, font sizing, color contrast, and audio output are whatever the vendor shipped, and transcripts are often a separate workstream that doesn't get built. Sanitization, charging, theft, repair, and the ongoing fleet upkeep — typically a low five-figure annual line on a mid-size deployment, dominated by cleaning labor — are real operational costs that a phone-based system erases.

A phone-based tour inherits the visitor's own device, which in turn inherits the operating system's accessibility stack — VoiceOver, TalkBack, dynamic type, captioning. Smartphone ownership in typical museum-visiting demographics is above 85% across most Western countries and above 90% in urban, higher-income, museum-going segments (MuseumNext, 2024). The remaining gap is real and worth planning for — most institutions handle it with a small fleet of loaner phones at the front desk — but the comparison has flipped: BYOD is now the default, and rented handsets are the accommodation.

Where the traditional model still wins

Honesty section. There are real cases where the legacy studio-and-handset model is the right answer, and we won't pretend otherwise. Four of them:

1. When a specific named voice is the headline. If a major donor has agreed to narrate a wing, if the curator herself is the public face of the institution, or if you've cast a celebrity ambassador, the production is the product. A generated voice cannot do what a named voice does — the voice is not the delivery mechanism, it's part of the curatorial offer. An artist-narrated tour at the Whitney or a curator-led series at the Getty is not interchangeable with a TTS read, no matter how good the TTS gets.

2. When the collection truly does not change for a decade. A permanent installation at a national monument, a fixed historical site whose interpretation is settled, a memorial whose script has been carefully negotiated with multiple stakeholders — these are environments where the update agility of an AI platform is a feature you'll never use. If you're going to produce a tour once and not touch it for ten years, the higher one-time production cost amortizes fine.

3. When you're mid-contract on a handset fleet. If you signed a five-year hardware-and-content contract with a legacy studio-and-handset vendor eighteen months ago, the realistic move is to ride out the contract, capture analytics on what's working, and plan the migration for renewal. Switching mid-contract usually doesn't pencil out.

4. When the production process itself is part of the institutional value. A few institutions — usually those with in-house audio teams, named producers, and a track record of award-winning interpretation — have a production process that is part of why their tours matter. The Met's audio guide is not just a tour, it's a Met artifact. For those institutions, the AI model is at best a complement to specific exhibits, not a replacement for the program.

If none of those four describe you, the AI-narrated model is probably the right choice, and the harder question is which platform.

What's the right next step?

If you've read this far, you're past the category-level question and into vendor evaluation. The most useful next move is usually to run a small paid or free pilot — one tour, real visitors, real analytics — against your current production process, and compare actual results rather than spec sheets. Convo offers a no-time-limit free pilot tier for exactly that reason; most platforms in the category offer something similar.

See the full pillar guide.

FAQ

For the default listen-and-walk experience, mostly no. For the follow-up experience — asking a question at a stop, going deeper on a detail, listening in a different language — yes, completely. The traditional model gives every visitor the same recording; the AI model lets every visitor have a different conversation off the same source material.

Usually yes. Most platforms accept existing scripts as reference material and will match the register and structure. You can bring your studio-recorded English audio in as the canonical track and use the AI platform to extend into the languages and updates you couldn't justify producing traditionally.

For a single small institution, $15,000–$20,000 a year covers a serious program: a phone-based platform subscription, light staff time to upload reference materials and review drafts, and signage for QR codes. There's typically no production line item separate from the subscription, which is the structural difference from the traditional model.

No. Docents do something neither model can do — read a room, take questions with full subject expertise in the moment, and model how to look. A 2026 docent program and an AI audio guide are complements that serve different units of value: the docent for the small share of visits they can cover with their full presence, the guide for the rest, which would otherwise get only wall text.

A defensible AI audio guide grounds visitor answers in the curator's uploaded reference materials, and declines to answer when it can't ground a response rather than inventing one. The narration itself isn't a hallucination risk — it's read from a curator-approved script. The conversational layer is where grounding matters; that's the dimension to ask vendors about hardest.

Increasingly yes, especially when framed around access and multilingual reach. The argument that lands is mission-aligned: "We can now serve Spanish, Mandarin, French, and Korean visitors in their first language across the entire permanent collection, for less than we previously spent on English-only audio for one wing." That's a board slide, not a defensiveness exercise.

Some already do, in pilot form. The shape of the question over the next 24 months isn't AI vs. traditional, but which platforms got the grounding, multilingual, and editorial-control parts right. The category will consolidate; the principle-of-evaluation work in our pillar guide is meant to outlast any particular vendor's positioning.

The verdict

For roughly nine in ten of the museums, cultural sites, and tour operators we talk to, the AI-narrated model is now the right default — not because the technology is impressive, but because the math of the traditional model never worked for most of a collection in most languages with anything close to current scholarship. The legacy model still wins where the voice is the point, where the collection truly doesn't change, where a hardware contract isn't yet expired, or where the production process itself is part of the institution's identity. Everywhere else, the comparison is no longer between two ways of making a tour — it's between making tours your institution couldn't previously justify and continuing to leave most of the building silent.

If you want the full category map before committing to a vendor, read the pillar guide on AI audio guides. If you're ready to look at numbers for your own institution, our pricing is published in full and the pilot tier is free.


About the author

Eric Duffy is the founder of Convo, a platform that lets museums and cultural institutions publish multilingual audio tours their visitors can have a conversation with. He writes about the economics of museum interpretation from inside the category — drawing on RFP data, discovery calls with curators and directors, and the production economics of both the studio-and-handset model and the AI-narrated model. Reach him at eric@convo.app or on LinkedIn.

WHAT WE’RE ASKING

Pick one gallery.
Give us two weeks.