ALL NOTES
ESSAY

Why grounding matters more than the model.

Every vendor pitch in 2026 leads with the model — GPT-5, Claude, Gemini, the next one. None of those are the moat. The moat is whether the platform decides to ground the answer or fall back to plausibility, and museums should be evaluating vendors on that, not on the model nameplate.

ERIC DUFFY·FOUNDER·JUN 3, 2026·9 MIN READ

Every vendor pitch you'll hear in 2026 mentions the model. Some lead with it. "We're built on GPT-5." "We use Claude." "We just upgraded to Gemini 3." The implication is always the same — that the choice of foundation model is the meaningful differentiator, and you should care which one a vendor picked.

You should not. The model is the part of the platform you should care about least. Not because models don't matter — they do — but because they're the part of the stack that gets swapped most easily, costs the least to integrate, and is least under the vendor's control. The real engineering work in an AI audio guide platform is the part nobody on the demo call talks about, because there's no logo for it. It's the grounding. It's the part of the product that decides whether a visitor gets an answer drawn from the curator's source materials or a plausible-sounding paragraph drawn from somewhere in the model's training data.

That decision is the moat. Everything else is plumbing.

What grounding actually is

I want to use the technical term because the technical term is real, and the marketing softens it into nothing.

"Grounding" is the platform's commitment to answer visitor questions from a defined set of source documents, and to decline gracefully when the question doesn't have an answer inside that set. In modern AI systems this is usually implemented through retrieval-augmented generation — the model is given the question, the platform searches the curator's uploaded materials for the most relevant passages, the passages are handed to the model with the question, and the model is instructed to answer from those passages or to say it doesn't have the information.

That's the technical shape. The interesting part is what's not in the shape: any path for the model to make up a plausible answer from the open internet, from its pretraining, or from inference about "what a museum probably has in its collection." A well-grounded platform constrains the model so tightly that the only outputs it produces are either supported by source passages it can name, or a refusal.

A poorly grounded platform — or a platform that hasn't really decided whether it cares about this — runs the same model with weaker guardrails. The visitor asks a question, the model produces a paragraph, and most of the time the paragraph is right because the model was trained on the public internet and the public internet includes a lot of museum content. Most of the time. Until it doesn't, and then the paragraph is a confident-sounding hallucination about a work the museum doesn't own.

The user experience of both platforms is identical until the moment they fail, and then one fails by saying "I don't have that information" and the other fails by inventing.

Why grounding is harder than the model is

The model is, increasingly, a commodity. The three or four labs that train frontier models all ship within months of each other, all hover within a few points of each other on the benchmarks anyone cares about, and all expose the same kind of API. Switching from one to another, for a vendor, is mostly a configuration change. We've done it. Other serious vendors in the category have done it. The cost is a few sprints of integration and prompt tuning, not a rewrite.

Grounding is the opposite. Grounding is plumbing, and the plumbing depends on judgment that no vendor can buy out of the box.

The plumbing has, at minimum, four pieces. There's the ingestion layer that turns a curator's uploaded materials — catalog notes, wall text, exhibition essays, a CSV from the collections management system — into structured passages the retrieval system can search. There's the retrieval system itself, which has to find the right passages at runtime, with the right ranking, in the right language, in fractions of a second. There's the prompt layer that tells the model what to do with what was retrieved, including the negative instructions — don't answer if you can't ground, don't paraphrase aggressively, don't fill gaps with reasonable-sounding general knowledge. And there's the refusal logic that decides what to say to a visitor when the retrieval came back empty.

Each of those four pieces has dozens of small judgment calls, and each judgment call cascades. A vendor that doesn't think hard about ingestion will index materials in a way that makes retrieval brittle. A vendor that doesn't think hard about retrieval will get the wrong passages and let the model paper over the mismatch with plausible language. A vendor that doesn't think hard about the prompt layer will see the model "helpfully" answer from training data when retrieval is thin. A vendor that doesn't think hard about refusal will produce a visitor experience that feels confidently wrong — which is the worst possible failure mode for a museum.

Switching the underlying model from GPT to Claude doesn't fix any of those problems. Switching the grounding architecture fixes all of them. That's why the model is the part vendors talk about and grounding is the part they don't — one is leverageable in marketing, the other is invisible in marketing.

Hallucination, more honestly than the category usually is

I want to be careful here, because the honest version of this argument is harder to make than the marketing version.

Grounding doesn't eliminate hallucination. The 2025 literature on retrieval-augmented systems is clear about this: grounding significantly reduces hallucination, and on most benchmarks the reduction is large, but it isn't zero. A model with a passage in front of it can still misread the passage, contradict it, or extrapolate beyond what's there. Anyone telling you the grounding architecture solves the problem is overselling.

What grounding does do, when done well, is change the failure mode. An ungrounded system, asked a question outside its competence, fails by producing a confident wrong answer that looks indistinguishable from a right one. A grounded system, asked the same question, fails by either retrieving the wrong passage and getting visibly tripped up on it, or by retrieving nothing and refusing to answer.

The first failure mode is the one a museum cannot live with. A confident hallucination about an artist's biography or a provenance chain is the kind of thing that ends up in a review, in a board email, in a curator's inbox the next morning. The second failure mode is one a museum can live with. A "I don't have that information about this stop" response is a workable answer in the moment — the visitor moves on, or asks the docent, or reads the wall card. It's not satisfying but it isn't damaging.

The procurement question, then, is not does this platform hallucinate? — every platform hallucinates sometimes. The procurement question is which failure mode does this platform produce when it fails? A vendor that can answer this question crisply, with examples, with documentation, with a demo that shows a refusal as cleanly as it shows a great answer — that vendor has thought about grounding. A vendor that gets uncomfortable and pivots to talking about the model has not.

Decline gracefully

The single most underrated feature of a serious AI audio guide is the refusal.

This is the part of the product that the marketing literally cannot describe well, because the demo of a great refusal is the absence of a great answer, and absence doesn't demo. But the refusal is what makes the rest of the platform trustworthy. A guide that will refuse a question it can't ground is one a curator can hand to a visitor. A guide that will produce something rather than say nothing is one a curator cannot.

In Convo's product, the refusal looks like a sentence in the conversation that says, plainly, "I don't have anything on that in the materials for this stop — you might check the museum's website, or ask a docent if one is nearby." It is not satisfying language. It is not, in some sense, the kind of language a marketing team would write. It is what a competent museum educator would say if asked a question outside their expertise. That's the model we want. The platform should behave like a careful colleague, not like a confident stranger.

I bring this up because if you take only one thing away from this essay, it should be this: when you evaluate a vendor, ask to see the refusal. Make them demonstrate the failure case as carefully as they demonstrate the success case. The refusal is the trust signal. If the refusal is messy, or hedged, or sounds like a hallucination apologizing for itself, you've found the vendor that hasn't done the work.

What to ask vendors

The procurement-grade version of this argument is a set of questions worth raising on every vendor call. They are not the questions the vendor wants. They are the ones that tell you whether the platform has been built by people who took grounding seriously.

Ask where the grounding lives. Is the model constrained to your uploaded materials, or does it have any path to answer from training data or the open web? "It has the option to fall back to general knowledge when retrieval is thin" is the wrong answer.

Ask what the platform does when retrieval comes back empty. The right answer is a graceful, plain refusal that tells the visitor the information isn't available. The wrong answer is anything that involves the model generating a "best guess" or filling the gap.

Ask how the vendor's pipeline distinguishes between a question that's adjacent to your materials and a question that's outside them. This is a subtle question and only vendors who have thought about it have an answer.

Ask whether the visitor's question and the model's response are logged for curator review, with the source passages used to ground the answer. If the answer is a vague "we have analytics," you don't have an audit trail. You have a starts-and-completions dashboard.

Ask, finally, what the failure rate looks like. Vendors who have measured this can tell you. Vendors who haven't measured it will reach for the model again. Listen for the reach.

The future this points toward

Two predictions that I think will look obvious in five years.

The first is that model commoditization is going to accelerate. The next two or three model generations will narrow the gap between frontier labs further, and "we use GPT" is going to be roughly as differentiating as "we use AWS" is today. A vendor that built their pitch around the model is going to have to rebuild it, fast, around something else. The vendors who built around grounding will look prescient.

The second is that the museum field is going to learn this. Procurement officers and curators are going to stop asking "which model do you use?" and start asking "what does it do when it doesn't know?" That shift, in this category, is the one that will sort the serious platforms from the ones that have a marketing site and a wrapper around someone else's API.

We are betting the company on the second question. I think we have it right.


About the author

Eric Duffy is the founder of Convo, a platform that lets museums and cultural institutions publish multilingual audio tours their visitors can have a conversation with. He writes about the engineering decisions behind AI audio guides for museum directors and curators evaluating the category. Reach him at eric@convo.app or on LinkedIn.

WRITTEN BY
Eric Duffy
Founder, Convo
ENJOYED THIS?

A note like this, end of every month.