A tour your visitors
can talk to.
The biggest shift in interpretation isn’t the audio — it’s that visitors now expect to ask. Convo’s conversational layer lets a visitor pause the tour at any stop and put a question to the guide in voice, text, or by holding their phone up to the object. The answer comes back grounded in the curator’s source materials, in the visitor’s language. When the guide can’t ground an answer, it says so.
What the visitor actually does.
A visitor walks up to a wall card, sees a QR, scans it with the camera app. The tour opens in their browser — no download, no account. They listen to a stop the way they would have listened to a borrowed headset. So far, nothing about the experience asks anything of them that audio guides haven’t always asked.
The change is what they can do next. At any stop, the visitor can tap a button and ask the guide a question. Why is his hand carved like that. What is this material. Is this the artist’s late work or his early work. Where is this person buried. Can you say that last part again, slower. The tour pauses, the guide answers, and the visitor chooses to keep going or to ask something else.
That shift — from listening to asking — is small in description and large in feeling. A visitor who can put a question to the room they’re standing in is in a different relationship to it than a visitor who is being talked at. The audio is no longer the product; the conversation is. (We’ve written about this idea separately in The audio guide is not the product.)
The visitor doesn’t have to ask anything for the tour to work. Most don’t. The ones who do tend to ask more than once, and tend to stay in the gallery longer. It’s the optional layer that changes the character of the whole visit — for the visitor who uses it and, quietly, for the visitor who knows it’s there and chooses not to.
The mechanical detail worth knowing is that the conversational layer is anchored to the stop the visitor is on. If the visitor is looking at object seven and asks a question, the guide knows the question is about object seven, and prefers sources that relate to it. It is not a free-floating chatbot grafted onto a tour. It is the same tour, asked a question, at a known location in the gallery.
Grounded in your sources, not the open internet.
Every Convo tour is built on top of a reference set the curator uploaded: wall cards, catalog entries, exhibition essays, scholarly articles, the institution’s own writing about its own collection. When a visitor asks a question, the guide searches that reference set for relevant passages, composes an answer from them, and presents it in the same voice as the rest of the tour.
When the reference set doesn’t contain a basis for the answer, the guide declines. It does not improvise. It does not reach to the open internet. It does not pull what it learned in pre-training and frame it as your institution’s answer. The default behavior is refusal, and refusal is the right behavior. A confident invention from a museum tour is worse than an honest miss.
This is the procurement-grade trust argument and we lean on it everywhere we can. The guide is bounded by the same canon the curators built it on. New canon ships when you ship it. Nothing else.
In practice this means two things. First, the quality of the Q&A tracks the quality of the source set you provide — thinner sources produce more declines, richer sources produce richer answers. The tool rewards curatorial care. Second, the visitor’s experience of being declined is not a system failure — it’s the system working. We’ve written more about the principle behind that in Authenticity and AI.
There is one more thing worth being explicit about. The model behind the Q&A has, of course, seen a great deal of the open web in pre-training. We don’t pretend it hasn’t. What we do is constrain the answer it’s allowed to give: the guide is instructed to answer from the curator’s sources only, and to refuse rather than to fall back on its general knowledge. This is the difference between a chatbot bolted onto a museum and a tour that happens to be conversational. The boundary is the institution’s canon.
Voice, text, and image — three ways to ask.
Voice is the obvious one. A visitor taps the microphone, speaks the question in whatever language they’re most comfortable in, and the guide answers aloud in the same language. The exchange feels like talking to a knowledgeable person — there’s no typing, no menu, no awkward pause while the visitor tries to phrase a search. We shipped voice on top of OpenAI’s Realtime API over WebRTC in February 2026, which is what makes the latency feel like conversation rather than a command-line.
Text is the unobtrusive one. Some galleries are quiet by design. Some visitors don’t want to talk to their phone in front of a stranger. Text Q&A is available at every stop, with a soft keyboard, in any of the ten languages. The same grounding rules apply. The same logs are written.
Image is the new one. A visitor can hold their phone camera up to the object in front of them and ask, “what is this”, or “tell me about this detail”. The image goes into the model alongside the question, and the guide answers grounded in the same reference set — not from open-web knowledge of similar-looking objects. This is especially useful for collections where objects don’t carry obvious labels: archaeological fragments, geology, technical apparatus, wide-walled installations where the relationship between object and card isn’t one-to-one.
All three modes share the same source set, the same refusal behavior, the same multilingual surface. Asking by camera in Japanese gets an answer in Japanese, grounded in the same English catalog the curator uploaded. The language layer is the same one that produces the linear narration; the visitor doesn’t cross any seams.
In practice visitors choose by context. Voice in a quiet room is awkward; text is fine. Voice with headphones on a busy floor is effortless. Image is the one we see used least often by count and talked about most often in feedback — it’s the moment the tour stops being a list of stops and starts being a conversation about the actual object in front of the visitor. We expect that ratio to keep tilting toward image input over time as the behavior becomes more familiar.
What curators see afterward.
Every visitor interaction is logged in the admin. A curator can open the tour, go to any stop, and read the actual conversations that happened there — the question the visitor asked, the answer the guide produced, and the time and language of the exchange. Not a sampled subset; the full set.
This is the part curators tell us is the most surprising. Not because they wanted surveillance, but because they wanted feedback, and audio guides have never given it to them. You ship a tour and you don’t know which line landed. You don’t know what the visitor was actually wondering. You don’t know which stops produced more questions than the next room. Now you do.
Declined questions are particularly useful. When the guide says “I don’t have a source for that,” that question still gets logged, and the curator can see, across a week or a month, what the canon missed. Sometimes the answer is to upload one more essay. Sometimes it’s to rewrite a stop. Sometimes it’s to write a new piece of wall text. The Q&A log becomes a feedback channel back into the institution’s own interpretation.
For the aggregate view — themes across many visitors, gaps across the collection, takeaway memos for the director — that’s Q&A Insights, and it has its own page.
Where the conversational layer doesn’t fit.
Two cases. The first is the broadcast tour by design. Some of the best audio I’ve ever heard in a museum is a single named voice telling a single story — a curator’s essay-as-audio, a writer walking you through their own response to a show, a sound work that is itself the art. Those tours aren’t supposed to be interrupted. They’re supposed to be listened to. Convo’s conversational layer is optional at every stop, and on a broadcast tour the right answer may be to leave it off and let the piece speak.
The second is the institution whose curatorial position is that the visitor should engage with the object and the wall text and not ask the audio anything. That position is defensible — there is a real argument that interpretation should set the frame and then step back. If that’s your house style, the audio-guide-as-broadcast model is closer to it than a conversational tour is, and we’ll say so on a call. We are not the right answer for every collection.
For everyone else — and that’s most institutions we talk to — the ability to ask is what makes a tour feel like a relationship instead of a recording. It’s also the lowest-stakes way to test whether the wall text you already have answers the questions visitors are actually showing up with. Which is part of why we keep the category guide honest about what the conversational layer changes and what it doesn’t.