How to choose museum audio guide software: an evaluation checklist.

KEY TAKEAWAYS

The category has fragmented faster than buyers can keep up. A 2024 vendor shortlist is unrecognizable in 2026; the platforms that were obvious leaders eighteen months ago now lag on grounding, multilingual review, or update latency.
Evaluate vendors on nine dimensions, not five: grounding architecture, language coverage and review workflow, editorial control, update latency, analytics depth, accessibility, data handling, exit and migration story, and support model.
The single most important dimension is grounding architecture — whether visitor answers are anchored to your reference materials and whether the platform refuses to answer when it cannot ground. Everything else is recoverable; an ungrounded conversational layer is not.
The two analytics dimensions are different products. Broadcast metrics (starts, completion, top stops) tell you what visitors did. Conversation metrics (clustered question themes) tell you what visitors are thinking. Most legacy vendors only ship the first; most AI-era vendors ship both.
A real evaluation costs roughly 20–40 hours of staff time spread over six to eight weeks. Skipping that work to "pick the cheapest" is the most expensive procurement decision in the category. The RFP template in the next piece of this pillar is the artifact that turns the checklist below into a scored shortlist.

The hard part of choosing museum audio guide software in 2026 isn't finding vendors. The category has gone from four serious platforms in 2020 to more than fifteen, and most of them will give you a polished demo. The hard part is knowing what to evaluate. Every vendor's deck claims multilingual, AI, accessibility, and analytics. The differences live a layer down — in the grounding architecture behind a conversational answer, the review workflow behind a Korean translation, the latency between fixing a script and a visitor hearing the fix.

The checklist below is the one I'd want a director or head of visitor experience to use against us and against every vendor on the shortlist. Where Convo leads on a dimension, I say so; where another platform leads, I say that too. The point of the document isn't to win — it's to give you the questions that make the answer obvious.

For the RFP template that turns this checklist into a scored evaluation, see the audio guide RFP template. For the broader category map, the AI audio guides pillar is the prerequisite read.

What does a real museum audio guide evaluation cover?

A serious evaluation of museum audio guide software covers nine dimensions, not the five most decks lead with. The five obvious ones — price, voice quality, languages, hardware model, and analytics — are necessary but not sufficient. They miss the four dimensions that actually predict whether a vendor will still serve you well in year three: grounding architecture (does the conversational layer make things up), editorial control (does the platform respect curators or work around them), update latency (how long from a script change to visitors hearing the new audio), and the exit story (what happens when you leave). Skip those four and you'll choose a platform that demos brilliantly and disappoints in production.

The rest of this piece is one section per dimension, with the question to ask, the answer pattern to listen for, and a brief note on which vendors currently lead.

Dimension 1: Grounding architecture

Ask vendors how the conversational layer is grounded, and what it does when it can't ground an answer. This is the single most important question in the category and the one most decks elide.

A modern AI audio guide does two things: it narrates a curator-approved script (low risk — the visitor hears exactly what the curator approved), and it answers visitor follow-up questions in chat or voice (high risk — the answer is generated on the fly). The grounding architecture is the discipline that anchors those follow-up answers to the curator's reference materials and refuses to answer when it can't. A platform without that discipline will confidently hallucinate provenance, dates, artists, and attributions to your visitors.

What to listen for: the vendor names a retrieval mechanism (the reference files the visitor's question is matched against) and describes a refusal behavior ("when the guide can't ground an answer in your sources, it says so rather than inventing one"). Vague answers ("we use the latest models," "our AI is trained on museum content") are red flags. The platforms currently leading on grounding are the ones built around reference-material-first authoring — Convo is one; some early-stage purpose-built platforms are others. Legacy vendors retrofitting AI onto handset content are not yet there.

Dimension 2: Language coverage and review workflow

Ask not just how many languages the platform supports, but how a curator reviews a translation they don't speak. Coverage is the marketing number; review workflow is the operational reality.

Most platforms now claim ten or more languages. The question that separates them is what happens when your curator approves a Spanish tour and the Mandarin version needs review but no one on staff reads Mandarin. The good answer is some combination of: a side-by-side back-translation into English so the curator can compare meaning, a glossary of institution-specific terms (artist names, technique vocabulary) that locks across all languages, and a workflow that flags substantive divergences for human review. The bad answer is "we use neural translation and it's very accurate," which leaves the curator unable to sign off responsibly.

Convo ships ten languages — English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic — with one-source regeneration; the review tooling for non-English languages is an active investment area. Other AI-narrated platforms differ on how many languages they offer, how those translations get reviewed, and whether translation work is human-supplemented; the legacy studio-and-handset vendors handle review well inside the studio model but charge per language. Ask each vendor for a screenshot of the review-a-Mandarin-translation-as-a-non-Mandarin-speaker workflow; the answer is informative.

Dimension 3: Editorial control and curator override

Ask whether a curator can edit a single sentence of a generated script, in seconds, without going through support. This is the line between a platform that respects curatorial expertise and one that works around it.

The principle behind the better platforms in this category is curators stay in charge — AI drafts, humans decide. That sounds obvious in a demo. The places it breaks down in production are small: a curator notices a phrasing they want to change in the German version; can they fix it in the admin, or do they have to file a ticket? A factual correction needs to land across all ten languages; does it cascade or do they edit ten times? An exhibition lead wants the tour for a new wing to sound different from the permanent collection's tour; can they configure a voice register, or is the whole institution one voice?

The good answer: direct line-by-line script editing, a natural-language draft assistant ("make this paragraph shorter and less formal") that takes instructions but doesn't auto-apply, per-tour voice configuration, and cascading edits across languages from a single source. The bad answer: any workflow that requires a support ticket for a content change. Most modern AI-narrated platforms now offer direct editing; the differences are in granularity and how the AI assist behaves — our compare pages cover the specifics where we've done the side-by-sides. Legacy handset vendors do not — content changes are still a support process.

Dimension 4: Update latency

Ask how long it takes from a script edit to a visitor hearing the new audio, in seconds. Then ask the same question for the Mandarin version, and for a tour that's currently live to 400 visitors.

This dimension separates AI-era platforms from everything else, and separates the well-built AI platforms from the early ones. The legacy answer is "we can update content; it's a scheduled re-record" — meaning days to weeks. The early-AI answer is "instant" — meaning the script changes but the audio cache takes hours to invalidate, or the change applies only to new sessions, or only English re-voices and the other languages lag.

The right answer: a curator edits a script line; on save, the audio for that line is regenerated within seconds; the change is live to the next visitor who reaches that stop, across every language the tour supports. The audit log records what changed, when, and by whom. The reason this matters is the principle we keep coming back to: a correction should not be a project. If your platform makes fixing an attribution error feel like a deployment, your curators will let errors stand.

The platforms that get this right have built around an event-driven re-render rather than a content management system bolted to a CDN. Ask the vendor to demo the full edit-to-live loop in real time. If they can't, you've learned something.

Dimension 5: Analytics depth — broadcast versus conversation metrics

Ask whether the vendor reports two distinct things: what visitors did, and what visitors asked. Most still only report the first.

Broadcast metrics are the legacy analytics product: tour starts, completion rate, top stops, dwell time per stop, drop-off curve. Every serious vendor in the category has these now, and they answer a real question — what's working in this tour and which stops should we improve.

Conversation metrics are the new product, and they're only possible because the tour is interactive. They report what visitors actually asked across the museum — clustered into themes (material and technique, provenance, symbolism, biography, conservation) — by gallery, by object, by language, by visitor segment. This is a different kind of insight: what are visitors actually thinking about, and what does that tell us about the curatorial story we're telling. It is, in the literal sense, an answer to a question no audio guide has ever been able to answer before, because no audio guide was ever interactive. According to the American Alliance of Museums reporting on 2024–2025 museum trends, this kind of audience-curiosity data is one of the strongest emerging signals directors use to plan programming.

Ask each vendor to show you both. If they only ship broadcast metrics, they're behind. If they ship conversation metrics but can't show clustered themes and language breakdowns, they're close but not done. The vendors leading here are the conversational-platform-native ones; the legacy vendors don't have the data to ship this product even if they wanted to.

Dimension 6: Accessibility

Ask whether the platform meets WCAG 2.2 AA and how it handles audio description, captions, screen readers, and font sizing. Then ask which of those work on day one without a separate workstream.

A phone-based platform inherits the visitor's device accessibility stack — VoiceOver on iOS, TalkBack on Android, dynamic type, system captions. That's a structural advantage over a vendor-shipped handset, where the device's accessibility is whatever the hardware OEM built. But "we run in a browser" isn't the same as "we're accessible." The platform still has to ship a player that respects screen-reader semantics, expose transcripts for every audio file, support captions for the conversational answers, and meet contrast and focus requirements on its own UI.

The Web Content Accessibility Guidelines 2.2 (W3C, October 2023) is the published bar. ADA Title II's 2024 final rule requires public-entity web content to meet WCAG 2.1 AA by April 2026 for institutions with 50,000+ residents in their jurisdiction and 2027 for smaller — relevant for any publicly funded museum. Ask vendors for their plan for audio description as a first-class feature, not a transcript-as-afterthought. The Accessibility pillar covers this dimension in more depth.

The serious modern web-based platforms in the category, including Convo, all ship credible accessibility stories. The legacy handset vendors are mixed — some are excellent (handsets designed for accessibility from the start), some are decade-stale. Ask for the artifact; don't accept the assurance.

Dimension 7: Data handling and privacy

Ask three questions: who owns the reference materials, are they used to train models, and what visitor data is collected and retained. Then ask for the answers in writing.

The reference-material question is the load-bearing one for the institution. The good answer: your materials are yours, they are not used to train any model (the vendor's or a sub-processor's), they are not shared with other institutions, and you can export or delete them on request. The bad answer involves any qualifier about "improving our service" that opens a training-data door.

The visitor-data question is the load-bearing one for legal and IT. A phone-based platform that supports conversational Q&A collects, at minimum: anonymous session data, the questions visitors ask, the answers given, and language preference. The good answer: minimal collection by default, no personal identifiers without explicit opt-in, GDPR/CCPA compliant, configurable retention windows, and a clear data processing agreement (DPA) for institutions in regulated jurisdictions.

Vendors with European institutional customers usually have the cleanest answers here because GDPR forced the work; vendors who've only sold in the US sometimes haven't done it yet. Ask for the DPA before signing, not after.

Dimension 8: Exit and migration story

Ask what happens to your tours, your reference materials, and your visitor analytics if you leave. A vendor that can't answer cleanly is a vendor you should think twice about.

The exit story is the single most undervalued dimension in audio guide procurement. Buyers assume they'll never leave; they leave more often than the industry admits, usually because pricing changes, the vendor pivots, or the platform falls behind. The clean answer: scripts export as a standard format (Markdown or plain text, plus metadata as JSON or CSV); audio files export as MP3 or WAV per stop, per language; visitor analytics export as CSV; reference materials are returned on request; the QR codes are yours and you can repoint them at any successor platform. The unclean answer: any version of "your tours live in our system."

This is also the question that exposes lock-in disguised as features. A vendor whose authoring tools only work inside their platform, whose audio cannot be exported, or whose analytics live behind their dashboard has built a switching-cost moat that benefits them and costs you. Ask for an actual export. The honest map of the category is that the AI-era platforms are mostly on the right side of this, and several legacy vendors are not.

Dimension 9: Support model

Ask what response times look like, who answers, and what's included versus billable. Then ask the reference customer the same question.

The support model is mostly invisible during evaluation and load-bearing in year two. A platform that takes three business days to respond to a content question is unusable for an institution that updates exhibits monthly. A platform that charges per-incident for content help has structurally misaligned its incentives with yours.

Convo's current support model is direct email to the team, with response times measured in hours during business days. We don't yet offer live chat, in-app support, a public knowledge base, or formal SLAs — the company is small and that's the honest scope of what we provide. As we grow, those gaps will close; for now the trade-off is the founders are in the loop on every support thread, which has its own value. Larger and more mature platforms in the category ship more formal support with documented SLAs and knowledge bases. Match the support model to your institutional context: a museum with a dedicated audio guide manager can absorb a lighter-touch vendor; a museum where the audio guide is one of twenty things on the visitor-experience lead's plate cannot.

Ask references three questions: how long does a typical support ticket take, what was the worst support experience you had in the last year, and would you choose this vendor again.

A scoring rubric for the nine dimensions

Score each vendor 1–5 on each dimension, weighted as follows. The weights reflect what we've seen actually predict satisfaction in year two, not what shows best in demos:

Dimension	Weight	What 5/5 looks like
Grounding architecture	20%	Named retrieval mechanism, refusal behavior, audit log
Language coverage + review	15%	10+ languages, side-by-side review, locked glossary
Editorial control	15%	Line-by-line edit, AI assist on request, no tickets
Update latency	10%	Edit-to-live in seconds, across all languages
Analytics depth	10%	Broadcast + conversation metrics, clustered themes
Accessibility	10%	WCAG 2.2 AA, audio description first-class
Data handling	8%	Materials not used for training, DPA available, clean retention
Exit + migration	7%	Standard-format export of scripts, audio, analytics
Support model	5%	Hours-not-days response, references confirm

A vendor below 70/100 should not be on the shortlist. A vendor above 85 is a credible finalist. The dimensions that most differentiate inside the 75–90 band are grounding, update latency, and exit — the three that are hardest to fix later.

Where this checklist doesn't fit

A few honest cases where this framework is the wrong tool. If you're producing a single tour for a named-voice celebrity narration, dimensions 4 (update latency) and 5 (conversation metrics) are mostly irrelevant — you're running the legacy production model on purpose, and the AI vs traditional comparison is where to start. If you're mid-contract on a hardware fleet with eighteen months left, this checklist applies to your next procurement, not this one. And if your institution truly only needs one language and one tour for a permanent collection that won't change in a decade, the dimensions collapse to four (cost, voice quality, hardware, accessibility) and a much simpler RFP works.

For most institutions evaluating in 2026, though, the nine dimensions above are the ones that separate vendors who'll still be serving you well in 2029 from the ones who won't.

FAQ

Six to eight weeks, with 20–40 hours of staff time. A shorter evaluation either skips dimensions that matter (grounding, exit, support) or relies on demos rather than hands-on trial. The RFP template in this pillar is designed to fit that timeline.

Three roles, minimum: the curator or education lead who'll author tours (champion), the visitor experience or operations lead who'll run the program (buyer), and an IT or legal reviewer for data handling and accessibility compliance (influencer). Larger institutions add procurement, accessibility, and DEI roles; smaller ones combine the buyer and champion.

For purchases above roughly $25,000 a year, a written RFP is worth the effort — it forces vendors to answer the dimensions that matter and gives you a paper trail for the board. Below that, a structured evaluation against this checklist with three vendors is usually enough. The RFP template in this pillar works for both.

Less than buyers expect. Price differences across credible AI-era platforms are smaller than the operational differences in update latency, grounding, and editorial control. A platform that's 20% cheaper but takes a week to update a script will cost more in staff time inside a year than the price difference saves.

Choosing on demo polish rather than year-two operational fit. The demos are all good now. The differences live in grounding, update latency, exit terms, and support — none of which are visible until you're three months in. The checklist above is designed to make those differences visible before signing.

Convo leads on grounding architecture, update latency, editorial control, and conversation analytics. Convo lags on formal SLAs, public knowledge base, and depth of non-English review tooling — those are active investments. The honest read is that we score in the high 80s overall and we'd rather you know that than over-claim.

At every contract renewal, and any time the vendor pivots, changes pricing materially, or misses on two or more of the nine dimensions over a quarter. The category is moving fast enough that a vendor who scored 88 in 2024 may score 72 in 2026 without anything seeming to change.

What to do next

If you're at the start of a procurement, the right next move is to walk this checklist against three vendors — including Convo if it's a credible fit — and produce a scored shortlist. The audio guide RFP template in this pillar formats the questions into a document you can send. The vendor comparison hub covers head-to-head reads on the platforms we get shortlisted alongside, like Convo vs Bloomberg Connects.

Our pricing is on one page, and the 30-day pilot is free — running one real tour against the checklist above usually answers more questions than another vendor call. Reach me at eric@convo.app if you want a hand interpreting the answers.

About the author

Eric Duffy is the founder of Convo, a platform that lets museums and cultural institutions publish multilingual audio tours their visitors can have a conversation with. He writes about the economics and evaluation of museum interpretation from inside the category — drawing on RFP data, discovery calls with curators and directors, and the operational reality of running a conversational tour platform. Reach him at eric@convo.app or on LinkedIn.