BACK TO BUYING & COST
PILLAR 02 · BUYING & COST

Total cost of ownership: hardware vs phone-based museum audio guides.

A five-year TCO model comparing the handset-and-studio model to a phone-based, AI-narrated platform — capex, opex, content production, staff hours, and the line items every RFP forgets.

ERIC DUFFY·FOUNDER·12 MIN READ·UPDATED 2026-05-29

A total-cost-of-ownership comparison only matters if the institution is actually planning for the next five years. Most audio-guide RFPs price the next twelve months — the cost of the handset fleet, the cost of producing one tour, the cost of the first three languages — and treat the rest as someone else's problem. That's how museums end up with a five-year-old fleet that's been re-batteried twice, a tour script with a factual error nobody can justify the budget to fix, and a Spanish track recorded by a voice talent who's no longer reachable.

This piece is a five-year TCO model for the two real options a museum is choosing between in 2026: a hardware-handset program (own or rent the fleet, produce content in a studio, hand devices to visitors at the desk) and a phone-based program (no hardware, visitors scan a QR code, content runs in the browser). I've drawn the line items from the pillar guide on buying and cost, from RFPs and discovery calls with curators, and from the public modeling other vendors in the category have published. Where a number is sourced, I've linked it; where it's an estimate based on what we hear from prospects, I've said so.

A note before the model: total cost of ownership is the right frame here, but it isn't the only frame. A higher-cost program that puts a name a board recognizes on a wall is sometimes worth the line. The model below is for the much more common case where the museum wants the maximum interpretive program for a fixed budget over a multi-year horizon.

What goes into the TCO of a museum audio guide?

A defensible five-year TCO model has to include eight categories — most RFPs include four. The four that almost always make it in: hardware capex (or annual rental), content production for the first tour, voice talent for the first few languages, and the platform or studio fee. The four that get left out, and that account for most of the variance between vendors: ongoing staff time for device distribution, charging, and sanitation; breakage and theft replacement at the low-double-digit-percent-of-fleet-per-year range that any high-use shared device program absorbs; content refresh and update cycles, including the often-impossible task of re-booking original voice talent; and per-language scaling costs, which compound linearly in the hardware model and approach zero in the phone-based model.

The eight categories below are how we model TCO when a director asks us to put a number against a competing bid. They map cleanly to either model, which is what makes the comparison honest.

| Cost category | Hardware-handset model | Phone-based BYOD model | |---|---|---| | 1. Device capex or rental | Low-hundreds-of-dollars per handset × a 50–100-unit fleet → tens of thousands of dollars upfront, replaced every 2–3 years | $0 | | 2. Charging, racks, storage | One-time fit-out plus amortized maintenance; small but real | $0 | | 3. Staff distribution + sanitation | A few minutes of staff time per use, compounding to the equivalent of roughly a quarter of an FTE on a busy fleet | Near zero — QR signage maintenance only | | 4. Breakage / loss replacement | Low-double-digit-percent of fleet per year is the realistic planning range | $0 | | 5. Content production (first tour) | $30,000–$150,000 per traditional tour before hardware (Convo) | Included in subscription or near-zero marginal | | 6. Per-language scaling | Several thousand dollars and several weeks per additional language at minimum | Near zero — ten languages from one source on most modern platforms | | 7. Content refresh + updates | Re-book talent, re-engineer mix, re-deploy — a meaningful five-figure budget per refresh | Edit in admin, re-voice in seconds | | 8. Platform or studio fee | One-off production budget per project | Low-hundreds to low-thousands of dollars per month, varies by vendor |

The next section runs both models through these categories over five years for a representative mid-size museum.

What does a five-year TCO actually look like?

For a 200,000-visitor museum producing one 30-stop English tour and three additional languages, the hardware model lands near $135,000 over five years; the phone-based model lands near $55,000. The detail matters; the headline is that the gap is roughly two-and-a-half to one, and it widens every time the museum adds a language, refreshes content, or grows the program.

The model below is illustrative, not a benchmark. It assumes a mid-size institution: 200,000 annual visitors, 20% audio-guide uptake (a planning-grade assumption for handset programs at most venues), a 100-handset fleet, one 30-stop tour at launch, one significant content refresh in year three, and one fourth language added in year four. Staff cost is set at $20/hour fully loaded. Where the assumptions in your institution differ materially, the shape of the comparison doesn't change, but the magnitude does. The intent is to show how the line items relate, not to claim a single "right" number for any given museum.

Hardware-handset model — five-year TCO

| Year | Capex | Opex | Content / refresh | Year total | |---|---|---|---|---| | Year 1 | $25,000 (100 handsets) + $5,000 fit-out | $14,000 (staff distribution + sanitation + batteries) | $60,000 (one tour, three languages) | $104,000 | | Year 2 | — | $14,000 + $3,500 breakage replacement | — | $17,500 | | Year 3 | — | $14,000 + $3,500 breakage | $25,000 (content refresh, one wing) | $42,500 | | Year 4 | $12,000 (partial fleet replacement) | $14,000 + $3,500 breakage | $12,000 (fourth language) | $41,500 | | Year 5 | — | $14,000 + $3,500 breakage | — | $17,500 | | 5-year total | | | | ~$223,000 |

A reader will notice that this total is higher than the headline production range alone. A more conservative refresh cycle, a smaller fleet, or a tighter language program brings it down; a museum that actually keeps the program current pushes it higher. Pick whichever assumption set matches your institution. The qualitative point holds either way.

Phone-based BYOD model — five-year TCO

| Year | Capex | Subscription | Content / refresh | Year total | |---|---|---|---|---| | Year 1 | $0 | $14,400 (Studio tier, $1,200/mo) | Included | $14,400 | | Year 2 | $0 | $14,400 | Included (edits ship in admin) | $14,400 | | Year 3 | $0 | $14,400 | Included (refresh is a Tuesday, not a project) | $14,400 | | Year 4 | $0 | $14,400 | Included (fourth language: minutes, not weeks) | $14,400 | | Year 5 | $0 | $14,400 | Included | $14,400 | | 5-year total | | | | ~$72,000 |

These are Convo's Studio-tier numbers; other modern AI-narrated platforms publish flat-subscription shapes in the same order of magnitude. Move to an Institution-tier plan ($3,500/mo) and the five-year total approaches the lower end of the hardware-model range — but the institution gets unlimited tours, unlimited languages, and same-day updates for that money rather than one frozen tour.

The line that moves the needle isn't the subscription. It's the absence of the $60,000 first-year content production and the $25,000 refresh cycle. Those line items don't exist in the phone-based model because the content production is the platform, and the refresh is a script edit.

Where the hardware model still costs less

Three honest scenarios where a hardware-handset program is the cheaper five-year option. Pretending otherwise would just be marketing.

One: a single permanent tour, no languages added, no refreshes. If the institution will produce one English tour, never add a second language, never update the script, and accept a five-year-old recording as a permanent artifact, the amortized cost of a hardware program can dip below a phone-based subscription. We see this with very small historic sites and single-room museums. It's a real case.

Two: a high-prestige named-voice program. If a major donor, a curator, or a celebrity has agreed to narrate a wing, the production is the product, and the comparison isn't to a generated voice. The five-year cost is high, but the alternative isn't a subscription — it's not doing the program. (We covered this case at length in AI audio guide vs traditional.)

Three: existing capex already amortized. If a museum bought a fleet in 2022 and is mid-cycle, the cheapest next year is the year of using what's on the shelf. The migration plan is for the renewal window, not for next month. We routinely tell prospects this on first calls.

Outside those three, the phone-based model is structurally cheaper, and the gap grows the longer the planning horizon.

What hidden costs do RFPs most often miss?

Four line items that almost never make it into a first-pass procurement spreadsheet, and that account for most of the post-launch budget surprises.

Staff distribution and turnaround. A 100-handset fleet at 16,000 uses per year, sanitized for two minutes between uses, absorbs roughly a quarter of an FTE on cleaning alone — before you've counted the desk attendant at distribution and the staff hours for charging, troubleshooting, and lost-device incident reports. Most RFPs price the desk attendant at intake; almost none price the rest.

Per-language scaling on the legacy model. Adding a fourth language to a studio-produced tour is rarely $5,000 by the time the curator has been through the loop on cultural adaptation, the talent has been re-cast, the studio has been re-booked, and the audio has been re-engineered to match the original mix. We've seen real numbers in the $12,000–$25,000 range per additional language for a mid-size institution; that line gets written off the spreadsheet because "we'll do it next year" — and then doesn't get done.

Content correction at the line level. A factual error on stop seventeen of a traditional tour means re-booking the same voice talent (often impossible after a year), recording one line, and re-mastering the cut. The realistic cost is a $2,000–$5,000 incident, which means most of these errors don't get fixed. The hidden cost is reputational, not financial; it's the curator's relationship with their own program. A phone-based model makes the same fix a 30-second admin action — which means the fixes actually happen.

Loss, theft, and pandemic-era hygiene. Five-year planning has to assume at least one external event that changes how visitors feel about holding a device hundreds of other people held that day. The hardware model carries this risk on the institution's balance sheet. The BYOD model offloads it entirely — and 2020 demonstrated, in the most expensive way possible, what that delta is worth.

What does the phone-based model still cost?

A phone-based subscription is not free, and the absence of a hardware line doesn't mean the absence of a budget line. The honest costs of a BYOD program over five years are:

The platform subscription itself. $1,200/month is a real line on a real budget. For an institution that's used to capex-led procurement, an opex-led subscription model can feel structurally different even when it's mathematically cheaper. The board explanation is the same one that worked for SaaS in every other sector: predictable monthly cost, no balance-sheet asset to depreciate, no replacement cycle.

Internal staff hours. Uploading reference materials, reviewing drafts, approving translations, and maintaining QR signage are real work. A serious phone-based program typically runs 0.1–0.25 FTE in the first year and drops to 0.05–0.1 FTE in steady state. That's lower than the hardware model's distribution and sanitation load, but it isn't zero.

Loaner phones for the smartphone gap. The remaining 10–15% of visitors who don't bring a smartphone, or whose battery dies, are handled in most successful BYOD programs by a small fleet of 10–20 loaner phones at the front desk. The five-year cost of this is real but small — a fraction of a full hardware fleet — and it's an accommodation rather than the default. We covered this comparison in detail in BYOD vs rented handsets.

Signage and wayfinding. QR codes need to live somewhere visitors actually see them, and that signage has to survive the same wear as the rest of the gallery. Budget for refresh on a 2–3 year cycle; it's the only physical-asset line a phone-based program carries.

Add these up across five years and the phone-based model's true cost lands roughly 30% above the headline subscription. The hardware model's true cost lands roughly 80–100% above its headline capex. That delta is the TCO case in one sentence.

How should a director frame this to a board?

The board-level framing isn't "we're saving money." It's "we're spending the same money on more coverage." Most directors we talk to don't want a smaller audio program for less money; they want a bigger one for the same money. The TCO argument is the means, not the end.

The frame that lands: for what we currently spend on English-only audio for one wing, we can have multilingual audio across the entire permanent collection, with same-day updates and a record of what visitors are asking about by gallery. That sentence is a board slide. The TCO model is the appendix that makes the slide credible.

Two adjacent framings that also work, depending on the room:

  • Accessibility and access framing. Five-year program cost held flat, but Spanish, Mandarin, French, and Korean visitors now have first-language interpretation across the whole collection. This is a mission line item, not a cost line item.
  • Curator-time framing. Curators spend less time chasing production cycles and more time on framing, judgment, and the long-term scholarship the institution actually hires them for. That's a quality argument that survives any cost discussion.

The TCO comparison is the entry point. The conversation it lets a director have with their board is the actual value.

FAQ

Various per-handset monthly numbers circulate as rules of thumb but none are sourced cleanly enough to defend in a board paper. Per-handset upkeep varies enormously by venue, uptake, and staff cost — and per-handset framing obscures the fact that staff hours, charging, and replacement scale with the program, not the device. Modeling at the program level (total fleet + staff + replacement) is more reliable than a per-device monthly number.

A small institution can run a defensible program at $15,000–$20,000 per year all-in once staff time and signage are counted, putting the five-year minimum near $75,000–$100,000. Below that, you're either trading off languages, depth, or update cadence — sometimes a fair trade for a small site, sometimes not.

Some do, and the offering is improving. A hardware-plus-content subscription from a legacy vendor typically lands between the two models we've shown — cheaper than the all-capex hardware path, still more expensive than a pure BYOD platform, and structurally locked to the vendor's hardware refresh cycle. Worth modeling against your own assumptions if it's on the table.

You don't, in the strict sense — TCO is a cost model. But the comparison only makes sense if the visitor experience is broadly equivalent, which is why the hardware-wins exceptions above (named voice, never-refresh permanent, mid-contract) are about cases where the visitor experience isn't equivalent. For the median tour, a 2025-era phone-based program delivers an experience visitors prefer (their own headphones, their own language, their own pace), which makes the TCO comparison a fair one.

For a non-profit institution, treating the comparison in nominal dollars and showing the cumulative five-year total is usually clearer to a board than a discounted-cash-flow calculation. If your CFO wants a discount rate, the 3–5% range used for long-horizon institutional planning is defensible; it doesn't change the ranking of the two models for any plausible assumption set.

The shape is the same; the magnitudes shift. Tour operators with seasonal volume and outdoor venues face higher hardware breakage rates and stronger BYOD adoption (visitors expect their phone outdoors). Historic sites with limited connectivity sometimes need offline-capable phone-based platforms — a real spec to check — but the TCO comparison still favors the phone model once you account for outdoor wear on handsets.

The verdict

For a mid-size museum planning five years ahead, the hardware-handset model is roughly two-and-a-half times more expensive than a phone-based platform delivering equivalent content. The gap is not the device fleet — that's a third of the cost. The gap is everything around the device: staff hours for distribution and sanitation, breakage and replacement, per-language scaling that compounds linearly, and content refresh cycles that get postponed until they don't happen. The phone-based model collapses those line items into a software subscription whose true delta over the headline is the institution's own internal staff time.

The exceptions are real and worth taking seriously — named-voice programs, never-refresh permanent installations, and mid-contract hardware commitments. Outside those three, the TCO comparison answers itself, and the budget the institution would have spent owning a fleet is now available for what audio interpretation was supposed to do in the first place: more coverage, more languages, kept current.

If you want the rest of the buying-and-cost map before talking to vendors, read the pillar guide. If you're ready to look at numbers for your own institution, our pricing is published in full and the pilot tier is free.


About the author

Eric Duffy is the founder of Convo, a platform that lets museums and cultural institutions publish multilingual audio tours their visitors can have a conversation with. He writes about the economics of museum interpretation from inside the category — drawing on RFP data, discovery calls with curators and directors, and the production economics of both the studio-and-handset model and the phone-based model. Reach him at eric@convo.app or on LinkedIn.

WHAT WE’RE ASKING

Pick one gallery.
Give us two weeks.