VISITOR Q&A

A tour your visitors
can talk to.

The biggest shift in interpretation isn’t the audio — it’s that visitors now expect to ask. Convo’s conversational layer lets a visitor pause the tour at any stop and put a question to the guide in voice, text, or by holding their phone up to the object. The answer comes back grounded in the curator’s source materials, in the visitor’s language. When the guide can’t ground an answer, it says so.

THE VISITOR FLOW

What the visitor actually does.

A visitor walks up to a wall card, sees a QR, scans it with the camera app. The tour opens in their browser — no download, no account. They listen to a stop the way they would have listened to a borrowed headset. So far, nothing about the experience asks anything of them that audio guides haven’t always asked.

The change is what they can do next. At any stop, the visitor can tap a button and ask the guide a question. Why is his hand carved like that. What is this material. Is this the artist’s late work or his early work. Where is this person buried. Can you say that last part again, slower. The tour pauses, the guide answers, and the visitor chooses to keep going or to ask something else.

That shift — from listening to asking — is small in description and large in feeling. A visitor who can put a question to the room they’re standing in is in a different relationship to it than a visitor who is being talked at. The audio is no longer the product; the conversation is. (We’ve written about this idea separately in The audio guide is not the product.)

The visitor doesn’t have to ask anything for the tour to work. Most don’t. The ones who do tend to ask more than once, and tend to stay in the gallery longer. It’s the optional layer that changes the character of the whole visit — for the visitor who uses it and, quietly, for the visitor who knows it’s there and chooses not to.

The mechanical detail worth knowing is that the conversational layer is anchored to the stop the visitor is on. If the visitor is looking at object seven and asks a question, the guide knows the question is about object seven, and prefers sources that relate to it. It is not a free-floating chatbot grafted onto a tour. It is the same tour, asked a question, at a known location in the gallery.

HOW GROUNDING WORKS

Grounded in your sources, not the open internet.

Every Convo tour is built on top of a reference set the curator uploaded: wall cards, catalog entries, exhibition essays, scholarly articles, the institution’s own writing about its own collection. When a visitor asks a question, the guide searches that reference set for relevant passages, composes an answer from them, and presents it in the same voice as the rest of the tour.

When the reference set doesn’t contain a basis for the answer, the guide declines. It does not improvise. It does not reach to the open internet. It does not pull what it learned in pre-training and frame it as your institution’s answer. The default behavior is refusal, and refusal is the right behavior. A confident invention from a museum tour is worse than an honest miss.

This is the procurement-grade trust argument and we lean on it everywhere we can. The guide is bounded by the same canon the curators built it on. New canon ships when you ship it. Nothing else.

In practice this means two things. First, the quality of the Q&A tracks the quality of the source set you provide — thinner sources produce more declines, richer sources produce richer answers. The tool rewards curatorial care. Second, the visitor’s experience of being declined is not a system failure — it’s the system working. We’ve written more about the principle behind that in Authenticity and AI.

There is one more thing worth being explicit about. The model behind the Q&A has, of course, seen a great deal of the open web in pre-training. We don’t pretend it hasn’t. What we do is constrain the answer it’s allowed to give: the guide is instructed to answer from the curator’s sources only, and to refuse rather than to fall back on its general knowledge. This is the difference between a chatbot bolted onto a museum and a tour that happens to be conversational. The boundary is the institution’s canon.

THREE WAYS TO ASK

Voice, text, and image — three ways to ask.

Voice is the obvious one. A visitor taps the microphone, speaks the question in whatever language they’re most comfortable in, and the guide answers aloud in the same language. The exchange feels like talking to a knowledgeable person — there’s no typing, no menu, no awkward pause while the visitor tries to phrase a search. We shipped voice on top of OpenAI’s Realtime API over WebRTC in February 2026, which is what makes the latency feel like conversation rather than a command-line.

Text is the unobtrusive one. Some galleries are quiet by design. Some visitors don’t want to talk to their phone in front of a stranger. Text Q&A is available at every stop, with a soft keyboard, in any of the ten languages. The same grounding rules apply. The same logs are written.

Image is the new one. A visitor can hold their phone camera up to the object in front of them and ask, “what is this”, or “tell me about this detail”. The image goes into the model alongside the question, and the guide answers grounded in the same reference set — not from open-web knowledge of similar-looking objects. This is especially useful for collections where objects don’t carry obvious labels: archaeological fragments, geology, technical apparatus, wide-walled installations where the relationship between object and card isn’t one-to-one.

All three modes share the same source set, the same refusal behavior, the same multilingual surface. Asking by camera in Japanese gets an answer in Japanese, grounded in the same English catalog the curator uploaded. The language layer is the same one that produces the linear narration; the visitor doesn’t cross any seams.

In practice visitors choose by context. Voice in a quiet room is awkward; text is fine. Voice with headphones on a busy floor is effortless. Image is the one we see used least often by count and talked about most often in feedback — it’s the moment the tour stops being a list of stops and starts being a conversation about the actual object in front of the visitor. We expect that ratio to keep tilting toward image input over time as the behavior becomes more familiar.

WHAT CURATORS SEE

What curators see afterward.

Every visitor interaction is logged in the admin. A curator can open the tour, go to any stop, and read the actual conversations that happened there — the question the visitor asked, the answer the guide produced, and the time and language of the exchange. Not a sampled subset; the full set.

This is the part curators tell us is the most surprising. Not because they wanted surveillance, but because they wanted feedback, and audio guides have never given it to them. You ship a tour and you don’t know which line landed. You don’t know what the visitor was actually wondering. You don’t know which stops produced more questions than the next room. Now you do.

Declined questions are particularly useful. When the guide says “I don’t have a source for that,” that question still gets logged, and the curator can see, across a week or a month, what the canon missed. Sometimes the answer is to upload one more essay. Sometimes it’s to rewrite a stop. Sometimes it’s to write a new piece of wall text. The Q&A log becomes a feedback channel back into the institution’s own interpretation.

For the aggregate view — themes across many visitors, gaps across the collection, takeaway memos for the director — that’s Q&A Insights, and it has its own page.

WHERE IT DOESN’T FIT

Where the conversational layer doesn’t fit.

Two cases. The first is the broadcast tour by design. Some of the best audio I’ve ever heard in a museum is a single named voice telling a single story — a curator’s essay-as-audio, a writer walking you through their own response to a show, a sound work that is itself the art. Those tours aren’t supposed to be interrupted. They’re supposed to be listened to. Convo’s conversational layer is optional at every stop, and on a broadcast tour the right answer may be to leave it off and let the piece speak.

The second is the institution whose curatorial position is that the visitor should engage with the object and the wall text and not ask the audio anything. That position is defensible — there is a real argument that interpretation should set the frame and then step back. If that’s your house style, the audio-guide-as-broadcast model is closer to it than a conversational tour is, and we’ll say so on a call. We are not the right answer for every collection.

For everyone else — and that’s most institutions we talk to — the ability to ask is what makes a tour feel like a relationship instead of a recording. It’s also the lowest-stakes way to test whether the wall text you already have answers the questions visitors are actually showing up with. Which is part of why we keep the category guide honest about what the conversational layer changes and what it doesn’t.

COMMON QUESTIONS

What procurement asks about the Q&A.

The guide says so. It’s designed to decline rather than guess. The visitor sees a short message that the question isn’t covered by the materials for this stop, and is invited to keep listening or to ask something else. Curators see the declined question in the admin log, which is often the most useful signal of all — it tells you what visitors wanted to know that you didn’t plan for.
Not by design. Every visitor answer is grounded in the reference materials the curator uploaded for that tour — wall cards, catalog entries, exhibition essays, scholarly articles, whatever you provided. The model is instructed to refuse rather than to fill in plausible-sounding detail. No system is perfect, but the failure mode we optimize for is a graceful decline, not a confident invention.
All ten Convo languages: English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, and Arabic. A visitor can ask in any of them — by voice or text — and the answer comes back in the same language, grounded in the same source set. The curator authors once.
It’s logged and shared with the institution that runs the tour, not exposed publicly. The admin shows the question, the guide’s answer, and the stop it happened at. Visitors don’t have to create an account, and we don’t attach a question to a real identity. Your reference materials and your visitor data are not used to train models.
The same refusal behavior applies: the guide answers from your sources only, and declines anything outside them. The platform also filters obvious abuse before it reaches a response. Inappropriate questions still appear in the log so you can see what’s being asked, but they don’t produce a response that could embarrass the institution.
Not yet through a per-tour rule editor in admin. The default refusal posture is the platform’s — answer from your sources, decline when you can’t — and it’s a deliberate one. If your collection has a sensitive topic that needs special handling, the right move today is to shape the reference materials and the curator note that sits in front of the tour, and we’ll work through edge cases with you directly.
The audio tour itself plays in a browser and is resilient to flaky connections. The conversational layer needs a live network round-trip — voice in, model, voice out — so a true dead zone is the one place it can’t reach. Most institutions solve this with venue Wi-Fi at the entrance, which is enough to bootstrap. If the gallery is genuinely offline, the linear tour still plays; the Q&A pauses until signal returns.
More on how this fits into the wider tour: Authoring, Multilingual, Q&A Insights, and Pricing.
HEAR IT FOR YOURSELF

Pick one gallery.
Give us two weeks.