If you've been asked to write an RFP for an audio guide and your starting point is a 2017 document the previous head of education left in a SharePoint folder, this piece is for you. The old template was built for a different product: hardware you rent, content produced by a studio, a contract measured in years and units. The product the market actually sells now — a SaaS platform, web-delivered, multilingual by default, with conversational layers — needs a different set of questions.
I'm the founder of Convo, so I have a stake in how museums write these RFPs. I've also watched too many otherwise-rigorous procurement processes either rubber-stamp the wrong vendor because the questions didn't surface real differences, or stall for six months because they tried to apply a hardware contract to a software purchase. This piece is a working framework: the six sections an audio guide RFP needs in 2026, the exact question wording I'd use under each, the clauses we hope museums ask for (including from us), and the parts of the old template you can safely cut.
If you haven't yet picked a shortlist of vendors to send the RFP to, start with the companion piece on how to choose museum audio guide software. This piece assumes you have three to five vendors in mind and need to give them all the same document.
What is an audio guide RFP actually for?
An audio guide RFP is a structured way to make three to five vendors answer the same questions in the same shape, so you can compare them on substance rather than sales-deck polish. It is not a contract, not a legal exposure document, and not — despite what it sometimes feels like — a way to force the vendor to commit to a fixed price. It's a procurement-grade evaluation tool whose only job is to surface the real differences between platforms that, on the surface, all claim to do the same thing.
The AAM's Independent Museum Professionals guidance on RFPs makes the point well from the consultant's side: a good RFP has a reasonable timeline, clear scope, a question-and-answer period, and selection criteria that aren't just "lowest cost." From the museum's side, the equivalent test is whether your RFP, read cold by a curator at another institution, would produce comparable responses from three different vendors. If the answers will all read identically, the questions are too generic. If the questions are so specific they map to one vendor's feature list, you've written a sole-source justification, not an RFP.
The six sections below are the ones I've seen do the actual work. Everything else — letterhead, table of contents, vendor instruction boilerplate — is the wrapper.
What should section one (institutional context) cover?
The institutional context section is where you give the vendor enough about your museum to write a relevant proposal — not where you tell your origin story. Three to five short paragraphs is enough. The vendor needs to know your size, your visitor profile, your collection scope, your current state of digital interpretation, and the specific outcome you're trying to produce. Pad beyond that and you'll get back proposals that quote your own paragraphs at you.
The questions to answer in your own copy, not pose to the vendor:
- Type of institution, square footage, and annual visitor count (range is fine).
- Languages spoken by your visitor population, with rough proportions if you have them. Per the US Census Bureau's American Community Survey, roughly 22% of the US population speaks a language other than English at home, but the figure in any given city varies enormously.
- Current state of audio interpretation: nothing, legacy handsets, an existing app, a previous AI pilot.
- The trigger for this RFP — new exhibition, expiring vendor contract, accessibility mandate, multilingual gap.
- The specific outcome (not "transform visitor experience" — something like "ship a multilingual audio guide for the entire permanent collection within six months").
What this section is not: a place to disqualify vendors on size. A small-team museum and a national institution can both run the same platform; what changes is the support package, not the product.
What should section two (scope and content) cover?
Scope is where most RFPs over-specify the wrong things and under-specify the right ones. The wrong thing to over-specify is the exact content structure ("twelve stops per gallery, three minutes each, in this voice"). The right thing to specify clearly is the boundary of the contract: which galleries, how many tours, which languages on day one, and what counts as "in scope" for updates.
A working scope section answers, in your own words:
- Galleries or sites covered, with approximate object counts.
- Number of distinct tours expected (e.g., one general tour, three exhibition tours, a children's tour).
- Languages required at launch and the priority order for adding more.
- Existing source materials available to the vendor (catalogs, wall text, exhibition essays, CMS export, prior tour scripts).
- The expected publishing cadence after launch — quarterly refreshes, exhibition-driven, ad-hoc.
Then the questions to the vendor:
- "Describe your typical workflow from receipt of reference materials to a publishable first draft. Include who does what on each side."
- "What file formats and source materials does your platform accept? Include any limits on file size, object count per tour, or character count per stop."
- "How are updates handled after a tour is live? Walk us through the exact steps for changing a single line of narration in one language. Include the latency from save to live."
That last question is the most useful one in the entire section. Update latency is the single variable that most distinguishes a SaaS audio guide from a re-skinned handset content pipeline. If a vendor's answer is measured in days or requires a support ticket, that's a different product than one where a curator hits save and the change is live in seconds.
What should section three (technical and platform requirements) cover?
This is the section where the "BYOD vs. handset vs. native app" decision shows up. In 2026 the realistic answer for almost every museum is web-delivered to visitors' own phones, with a small loaner fleet at the front desk for visitors without smartphones. Smartphone ownership in typical museum-visiting demographics is above 85% across most Western countries and substantially higher in urban, museum-going segments. Pretending otherwise — and committing to a four-figure-per-month handset fleet to serve a 15% gap — is the kind of carryover decision that quietly burns budgets.
Questions to the vendor:
- "Describe the visitor's path from arriving at a stop to hearing the narration. Include browser support, network requirements, and offline behavior if any."
- "What is your hosting and content delivery architecture? Specify cloud provider(s), regions, and CDN."
- "What is your published uptime, and what is your stated SLA for the visitor-facing player?" Most serious SaaS vendors will commit to 99.9% or better.
- "Describe your admin portal's user model. How are roles and permissions handled for staff with different responsibilities (curatorial, education, marketing, IT)?"
- "What analytics do you provide on tour use? Include any data on dwell time, completion, drop-off, and visitor questions if your platform supports a conversational layer."
- "How does your platform handle exhibition signage and QR codes? What sign-spec guidance do you provide?"
The question we hope you ask, even if you ask it of us:
- "What happens if we decide to leave the platform? Describe data export — both source materials and produced audio — and any contractual lock-in around tour content." The honest answer involves a documented export path and no claim of ownership over the museum's reference materials. The dishonest answer is a hand-wave.
What should section four (AI grounding and editorial control) cover?
This is the section the legacy RFP templates don't have, and the section where serious vendor differences live. It is also the section where the wrong vendor will charm you in a demo and the right vendor will get specific. The questions here are about how the platform handles the gap between "AI generated something" and "a visitor heard something the museum vouches for."
Recommended question wording:
- "Describe how your platform constrains script drafts to the curator's uploaded reference materials. Specify the technique (retrieval-augmented generation, fine-tuning, prompt engineering, or other) and the conditions under which the system would draft content that isn't in our sources." A platform that can articulate this clearly is one that has thought about it. One that can't is one whose model is doing whatever it does and hoping you don't notice.
- "How does the platform handle visitor-facing Q&A (if you offer it)? Specifically: what does the system do when a visitor asks a question your platform cannot ground in our source materials? Provide example outputs for both grounded and ungrounded queries." The right answer involves declining to answer or surfacing a fallback, with the visitor told what happened. The wrong answer is silence about the failure mode.
- "Describe the editorial review workflow. Specifically, can our team review and approve every piece of audio before it reaches a visitor? Is this a configurable workflow or a hard-coded approval gate?" For most curatorial teams, "nothing reaches a visitor that a curator didn't approve" is non-negotiable.
- "What is your platform's behavior when our reference materials contain a factual error a curator later corrects? Does the change propagate to visitor-facing Q&A in addition to the narrated tour?"
- "Describe any audit trail your platform maintains: which staff member edited which line, when, and (for visitor Q&A) which source document was cited in a given response."
You should expect different answers from different vendors here. If two vendors give identical answers, you haven't asked sharply enough.
What should section five (security, privacy, and data) cover?
The security section is where SaaS RFPs converge across industries, and where the same questions you'd ask of any vendor handling your data apply — plus one or two specific to AI platforms. The market expectations in 2026 are well-established: a SOC 2 Type 2 report or equivalent, encryption at rest and in transit, documented access controls, a Data Processing Addendum (DPA), and a subprocessor list. (Convo's DPA is published; many vendors will share theirs under NDA, which is fine.)
The standard questions:
- "Do you have a current SOC 2 Type 2 report? When was it last issued and by whom? Will you provide it under NDA?"
- "Provide your Data Processing Addendum and your subprocessor list. Identify any subprocessors located outside the United States and the EU."
- "Describe encryption at rest and in transit, your access control model (MFA, least-privilege, role-based access), and your breach notification commitment." Market standard for notification is 48–72 hours.
- "What is your data retention policy after contract termination? When is our data deleted and how is deletion confirmed?"
- "What is your incident response process? Provide a summary of any security incidents in the last 24 months."
Then the two questions specifically for AI platforms — and the ones we hope every museum asks every vendor:
- "Do you use our reference materials, visitor questions, or any other data we provide to train, fine-tune, or improve any AI model — your own or a subprocessor's? If yes, describe exactly what data and for what purpose. If no, provide the contractual language we can rely on." The clean answer is a clear no, backed by contractual language to that effect. If a vendor can't commit to that in writing, your reference materials are paying part of their product development cost.
- "Which third-party model providers (OpenAI, Anthropic, Google, ElevenLabs, etc.) does your platform use? What contractual commitments do you have with those providers regarding the use of customer data for model training?" The chain is only as honest as its weakest link; a vendor whose own DPA prohibits training but whose underlying model provider's terms allow it has a problem they haven't solved.
What about accessibility — should it be its own section or live inside section three?
Accessibility belongs in the body of the RFP, not in a separate questionnaire after award. The procurement-best-practice language for digital tooling in 2026 is straightforward: conformance to WCAG 2.2 Level AA, evidence of third-party testing, and accessibility maintained through future platform updates. The DOJ's ADA Title II rule extends WCAG 2.1 AA compliance to public-entity web content and mobile applications by April 24, 2026, with serving 50,000+ people as the threshold; many museums fall inside that scope, and designing to 2.2 provides margin.
Add to section three (technical), or use a clearly labeled section 3B:
- "Describe your conformance to WCAG 2.2 Level AA, including any known exceptions and remediation timelines."
- "Describe testing methodology — automated, manual, assistive-tech, and any third-party audits in the last 24 months."
- "Demonstrate, in product demo, keyboard navigation, screen reader operation (VoiceOver and TalkBack), and dynamic-type behavior on iOS and Android browsers."
- "Describe audio description support: text transcripts at every stop, visual descriptions of the object being interpreted, and any features specifically for blind and low-vision visitors." For more on the substance here, see our pillar on accessibility and inclusion.
- "Describe captioning and transcript availability for d/Deaf and hard-of-hearing visitors. Are transcripts produced for every language, and at what latency relative to audio publication?"
A phone-based audio guide gets a real accessibility tailwind from the visitor's own device — VoiceOver, TalkBack, dynamic type, captions, and assistive-tech compatibility come with the OS. But that's an architectural advantage, not a substitute for the platform doing its own accessibility work. The questions above test whether the vendor has done that work.
What should section six (commercial terms and SLA) cover?
The commercial section is where the SaaS-vs-hardware conceptual shift shows up most clearly, and where the old template will mislead you fastest. A handset contract from 2015 has pages of language about per-device repair, theft replacement, sanitization, charging-station maintenance, and consumable parts. None of that applies to a SaaS platform. What does apply:
- Subscription term and renewal. Annual or multi-year, auto-renewal terms, price-protection language across renewal cycles.
- Pricing model. Per-tour, per-language, per-stop, per-visitor, or flat platform fee. The serious modern vendors are moving toward flat platform fees — Convo's published pricing is $0/Pilot, $1,200/month Studio, $3,500/month Institution, with all paid tiers including unlimited tours, unlimited languages, and unlimited edits. Vendors who charge per-stop or per-language are betting on you growing into a bigger bill; vendors with flat pricing are betting on retention.
- Uptime SLA. 99.9% is the market floor for visitor-facing SaaS; some vendors offer 99.95% on enterprise tiers. Specify the measurement window (monthly or quarterly), the credit mechanism if the SLA is breached, and what counts as scheduled vs. unscheduled downtime.
- Support SLA. Response-time commitments by severity. Typical SaaS shapes: 4-hour response on critical, 1 business day on high, 2 business days on medium. Ask whether support is included in the subscription or sold separately.
- Update latency commitment. Distinct from uptime: how long from "curator saves a corrected line" to "next visitor hears the corrected line." This belongs in the SLA, not the feature list.
- Data portability and exit. Mentioned above; commits to documented export of source materials and produced audio on termination, with a defined timeline.
Question wording for the vendor:
- "Provide your standard SLA, including uptime, support response times, and any update-latency commitments. Specify scheduled-maintenance windows."
- "Describe your pricing model in full. Include any per-stop, per-language, per-visitor, or per-tour fees not in the base subscription. State your price-protection commitment across renewal cycles."
- "What is your standard contract term and termination notice? Describe any early-termination fees and the data-return process on termination."
If the proposal you get back has more language about hardware than about software, you've been quoted a different product than the one the market is now selling.
What to cut from the old template
The most useful editing pass on an inherited audio guide RFP is the deletion pass. Specifically:
- Hardware-fleet language. Anything about device counts, charging stations, sanitization protocols, theft replacement, or repair turnaround. If your platform requires handsets, you're back in a 2015 contract.
- App-store distribution language. If the platform requires visitors to download a native iOS or Android app, you're committing to the low-single-digit adoption rates the published industry research keeps reporting for museum apps (Frankly Green + Webb). Web delivery is the visitor-experience default in 2026.
- Production-cycle language assuming studio recording. "Voice talent selection," "studio booking," "mastering and QC" do not map onto an AI-narrated platform. The equivalent questions live in sections two and four — drafting workflow, editorial control, multilingual generation.
- Per-tour pricing assumptions. If your evaluation rubric assumes you'll buy three tours and pay per-tour, you'll mis-rank flat-fee vendors against per-tour vendors. Normalize on five-year total cost of ownership.
- Anything about CDs, MP3 players, or rental kiosks. Yes, this still shows up in templates. Yes, you should cut it.
What's the right next step?
If you're inside two weeks of issuing this RFP, the most useful next move is to walk it through one trusted vendor as a friendly read-through — not to negotiate, but to find the questions that don't make sense in their world. The questions a good vendor flags as ambiguous are usually the questions that would have produced incomparable answers from the others.
If you haven't yet drawn up your shortlist, our companion piece on how to choose museum audio guide software walks through the criteria for that step. For the broader pillar on procurement and total cost of ownership, see buying and cost. And if you'd like to see the answers Convo would give to this RFP — including the security and AI-grounding clauses — our DPA and pricing are both published, and you can run a free pilot before any procurement decision.
FAQ
The verdict
A good museum audio guide RFP in 2026 is shorter, sharper, and more specific to SaaS than the templates most institutions are starting from. Six sections do almost all the work: institutional context, scope and content, technical and platform, AI grounding and editorial control, security and data, and commercial terms. The questions that surface real vendor differences are about how something is done — update latency, grounding behavior on ungrounded questions, training-data prohibitions, editorial workflow — not whether it's done. Everything else is sales-deck noise.
Two questions are worth asking even of us: an explicit ban on using your reference materials to train models, and a documented grounding behavior with example outputs for both grounded and ungrounded visitor queries. If a vendor — Convo included — can't answer those crisply, you've learned something useful about the platform.
For the broader context, see the buying and cost pillar, the companion piece on choosing museum audio guide software, and our accessibility pillar for the accessibility section in more depth.
About the author
Eric Duffy is the founder of Convo, a platform that lets museums and cultural institutions publish multilingual audio tours their visitors can have a conversation with. He writes about the economics and procurement of museum interpretation from inside the category — drawing on RFP responses, discovery calls with curators and directors, and the contract shape that fits a SaaS platform rather than a handset fleet. Reach him at eric@convo.app or on LinkedIn.