ESSAY

The audio guide is not the product.

The script, the narration, the playlist of stops — these are delivery mechanisms. The actual product is the visitor's relationship to your collection. Most vendors are selling the wrong thing.

ERIC DUFFY·FOUNDER·MAY 29, 2026·10 MIN READ

A single ceramic vessel on a low plinth in a quiet, empty museum gallery, side-lit by daylight from a tall window — the lead image for the essay on why the audio guide is not the product.

The audio guide is not the product. The product is the visitor's relationship to your collection. The audio is one of the ways that relationship gets built — a useful one, sometimes a beautiful one — but it is a means, not the thing itself. Most of the audio-tour industry has been selling the means for so long that it has stopped noticing.

This is the part that took me a while to see clearly, and the part I now think matters more than anything else we have argued about on this site.

The view I am arguing against

The conventional view — the one I held for the first few months of building Convo, the one almost every vendor in this category implicitly endorses — is that an audio guide is a finished cultural product. It is written, voiced, mixed, packaged, and shipped, the way an album or a documentary is. You evaluate it the way you evaluate those things: is the script good, is the narrator's voice warm, is the run time right, is the mix clean, are the stops well-chosen.

This is not a stupid view. It comes from a real place. The best audio tours of the last forty years — Acoustiguide's classics, the BBC-style narrative tours that built the form, the great single-voice tours written by curators who happened to also be writers — earned the comparison. They were authored objects. People wrote essays about them. That is the high-water mark, and it is genuinely high.

So when buyers in this category go shopping, they ask the questions you ask about an authored object: who is writing it, who is voicing it, how long does each stop take, how good is the mix. And vendors answer in kind. Sample scripts. Voice reels. Production timelines. The whole sales motion, top to bottom, treats the deliverable as the artifact.

I want to take this view seriously, because it is not wrong about the craft. The craft matters. A bad script in a great voice is worse than no audio at all. But the view is wrong about what it is the craft is in service of.

So what is the product, actually?

Here is the test that broke it for me.

Think of a museum visit you remember. Not one you read about — one you actually had. Stand in the gallery in your head. What do you remember?

What people consistently report, when they answer that question honestly, isn't the audio. They remember the moments where something clicked. The painting they stayed too long in front of. The artifact whose meaning landed slowly as they walked around it. The detail a docent pointed out that they hadn't seen the first time. The thing they came back to before they left the building. The audio guide, when there was one, is somewhere in the background of those moments — sometimes it primed the click, sometimes it was simply present. But the audio is not what they remember. The relationship they formed with the object is what they remember.

This is, I think, the only thing that matters about a museum visit. Not the only thing that happens — visitors look at things, take photos, eat in the café, buy postcards, get tired — but the only thing that matters in the way museum directors and curators mean when they talk about why their institution exists. The visitor stood in front of a thing, and something happened between them and it. They walked out connected, in some small way, to a piece of human or natural history they had not been connected to before. That is the product. Everything else is logistics.

The audio guide is one of the most powerful tools we have for making that happen, which is exactly why it gets mistaken for the thing itself. But nobody walks out of a museum saying the audio was good. They say the visit was good, and the audio was part of why.

What this changes for curators

Once you accept that the relationship is the product, the curator's job changes shape. Not in substance — curators have always known this — but in where the labor goes.

If the audio is the product, the curator's job is to write it. To draft a script, send it to a voice booth, listen back, revise, approve, ship. That is most of the production calendar. It is the work the legacy vendors are built around servicing.

If the relationship is the product, the curator's job is to figure out what a visitor needs to be told, asked, or shown to have a chance of connecting with this object. The script is one output of that work, but it is not the only one. The suggested prompts a visitor can tap to go deeper are another. The way the guide refuses questions it cannot ground is another. The decision about what is on the wall card and what is held back for the curious is another. The language access is another. The pacing — the choice to write a thirty-second stop instead of a two-minute one because this object rewards short attention — is another.

The curator's product is the visitor's encounter with the collection. The script is a sliver of how they shape it. When you build a platform around producing scripts faster, you make scripts faster, and you change nothing else about the visit. When you build a platform around shaping encounters, you sometimes write less audio, not more.

What this changes for vendor evaluation

The way museums currently evaluate audio-tour vendors is, by and large, the wrong evaluation for the actual product.

The current questions: how much per stop, how many languages, how long is the production cycle, what is the voice talent, what is the editing process. These are real questions and they matter. But they all assume the deliverable is the audio.

The questions I think buyers should be asking instead — the ones that test whether the vendor understands what the product actually is:

What do visitors do with this tour, not just listen to? An audio tour where the only action is pressing play is a broadcast, and the relationship a broadcast builds is bounded. Tours that let visitors ask, follow, or sit with a single object longer than the script wanted them to — those build a different kind of relationship. Most vendors do not build them, because the deliverable model does not require it.

What does the vendor refuse to do? A vendor that will produce whatever you give them is not your partner; they are your printer. A vendor with opinions about what a visitor should be able to ask, what the guide should decline to answer, what counts as grounded — those opinions are evidence the vendor has thought about the relationship, not just the artifact. You want the opinions, even when you disagree.

What happens after launch? In the deliverable model: nothing, until the next production cycle. Corrections wait. New acquisitions get the same treatment as old ones — none. A vendor that treats launch as the start, not the end, thinks the relationship is the product. A vendor that ships and disappears thinks the audio is.

Can you see what visitors actually asked? Not how many played, how long they listened, what the completion rate was — those are broadcast metrics. The relationship metric is: what did they want to know that you did not already tell them? Nobody could answer that question for museums before, because nobody's tours were interactive.

What museums should measure

If the product is the relationship, the measurements have to change too.

The metrics that get reported up to boards today — number of tour starts, completion rate, dwell time, average session length — measure the audio's success as a piece of media. They are the audio-tour equivalent of TV ratings. They tell you the artifact got watched. They do not tell you anything about whether the visitor walked out of the gallery connected to what they saw.

The metric that actually matters is harder to capture, but it exists. It is the curiosity the visit produced. The follow-up question the visitor asked the guide. The thing they typed into the chat after the script ended. The object they came back to. The note they sent to the friend they were with. The Google search they did on the train home. Some of this is unmeasurable and should stay that way — the inside of a museum visit is not a funnel and never will be. But some of it is measurable now, in a way it was not before, because the visit is a conversation.

The most useful slide for a museum director isn't the listen count — it's the topic cluster: a grouped, anonymized view of what visitors asked about across the museum. Material and technique. Provenance. Symbolism. That slide is a portrait of a relationship. The institution can see what it stoked, what it left unsatisfied, what it accidentally said the loudest. No legacy vendor's tour can produce that slide, because no legacy vendor's tour invites the question.

If you measure the right thing, the product reveals itself.

What Convo deliberately does not build

There are things we could ship that would make our product look more like a deliverable, that we are not going to ship.

We are not going to ship a generic AI docent. The fastest thing to build in this category is a general-purpose chatbot pre-loaded with public art-history knowledge, with the museum's branding on top. It would demo well. It would also produce a guide whose relationship with the visitor is the same relationship the visitor has with any other chatbot — not a relationship with your collection, but a relationship with a model. We refuse to ship that because it would devalue the thing the museum is actually selling.

We are not going to ship one-click translation that runs at playback time. Customers sometimes ask for this. The relationship is between the visitor and the curator's voice, and a translation pipeline that fires at playback means the curator did not approve what the visitor heard. We voice each language ahead of time so a curator can review what a Spanish-speaking visitor will actually be told. The harder path lets the relationship survive translation.

We are not going to ship richer broadcast — longer scripts, sound design, music beds. We could. The voice synthesis is good enough. We are going to stay short and let the visitor's questions do the deepening. The unit of value is the conversation, not the cinematic two-minute stop.

These are not features-we-have-not-built-yet. They are features we are choosing not to build, because building them would make the product worse at the thing it is actually for.

The strongest objection to all this

Here is the version of the counterargument that I find genuinely difficult.

The conversational, relationship-first model I am describing has a quiet assumption baked into it: that visitors will engage. That they will ask the question, tap the prompt, follow the thread. The broadcast model does not require this assumption. It works on visitors who press play and do nothing else, which is most visitors. A great script in a great voice does the work for them. It builds something — maybe not the deep relationship a conversation would, but a real one — even when the visitor is tired, or distracted, or with kids, or just not in the mood to engage. The broadcast model is the floor. The conversational model is the ceiling.

If you optimize for the ceiling, you might lose the floor. You might build a product that is incredible for the ten percent of visitors who actually engage and worse than a wand for the ninety percent who do not. That is a serious tradeoff and I would be lying if I said we have it figured out.

The answer I am holding to, but not finished with, is that the broadcast layer has to still be excellent. The default playthrough — the visitor who presses play and walks through the gallery and does not ask a single question — has to be a good audio tour. Not a stripped-down version. A good one. The conversational layer is what is added, not what replaces it. If the broadcast collapses, the product collapses with it. We have to build both, well, and resist the temptation to let either eat the other.

I think the platforms that survive the next ten years in this category are going to be the ones that hold that double standard — broadcast that is genuinely good for the passive visitor, and conversation that is genuinely good for the active one — without using either as an excuse for the other being mediocre. That is harder than picking one. It is also the only honest answer to the objection.

The line I want to leave you with

The visits people remember are the ones where, for thirty seconds or three minutes, an object stopped being a thing in a room and started being a thing in their head. Nothing else a museum does matters more than producing that moment, and nothing the institution can buy or build is the moment itself. The audio is a tool. The script is a tool. The platform is a tool. The relationship is the product. The institutions that understand this — and the vendors that build for them — will make the next era of museum interpretation. The ones still selling deliverables will be fine for a while, and then they will not be.

The audio guide is not the product. It never was. We just sold it that way because it was the part we could put a price tag on.

About the author

Eric Duffy is the founder of Convo, a platform that lets museums and cultural institutions publish multilingual audio tours their visitors can have a conversation with. He writes about the economics and craft of museum interpretation. Reach him at eric@convo.app or on LinkedIn.