WCAG — the Web Content Accessibility Guidelines — is the international standard most procurement teams now reference when they ask whether a museum audio guide is accessible. The current target for new builds is WCAG 2.1, Level AA, which the US Department of Justice formally adopted as the technical baseline for state and local government web content and mobile apps in its April 2024 Title II rule (ADA.gov, 2024). The compliance deadlines for covered public entities run through April 2027 and 2028 depending on population size, with a one-year extension issued in April 2026.
Private museums are not directly bound by Title II — that's Title III territory, and the DOJ hasn't yet published an equivalent technical rule for private accommodations. But in practice, every RFP we see from public-sector and major private institutions now writes WCAG 2.1 AA into the procurement language. If you're publishing a visitor-facing tour in 2026, it is the standard to design and procure against.
This piece walks through the WCAG criteria that actually matter for a phone-based, browser-based audio guide. It's organized by the WCAG principle (Perceivable, Operable, Understandable, Robust) and focused on the specific failures we've seen in shipped products. For the broader legal picture, see our ADA requirements piece. For the hearing-specific accommodations, see hearing accommodations.
Why does WCAG 2.1 AA matter for a museum audio guide web app?
Because it's both the legal floor for covered institutions and the procurement language private institutions are adopting voluntarily. The DOJ Title II rule names WCAG 2.1 AA as the technical standard for state and local government web content and mobile apps, with compliance deadlines of April 26, 2027 for entities serving 50,000 or more residents and April 26, 2028 for smaller entities and special districts (ADA.gov, 2024). That captures the majority of US public museums and historical sites either directly or through their parent government.
Beyond the law, the audience is real. More than one in four US adults has some kind of disability, with cognition at 13.9%, mobility at 12.2%, hearing at 6.2%, and vision at 5.5% (CDC, 2024). The share is materially higher in older cohorts, which overlap heavily with most museums' core visiting audience. Accessibility is not a small accommodation for an edge case; it's the experience for a meaningful slice of every gallery.
What does the BYOD model inherit from the visitor's device?
Most of the assistive technology stack — and that's the structural advantage of moving off rented handsets. A rented handset is whatever the vendor shipped: its screen reader is the one they built (or didn't), its font sizing is the one they wired up, its captioning is whatever the firmware supports. The visitor cannot bring their own. A BYOD web app inherits the operating system that the visitor has already configured for themselves: VoiceOver and Dynamic Type on iOS, TalkBack and Font size on Android, system-level captioning, system contrast settings, system-level voice control, and any third-party assistive tech they rely on at home.
The implication for the player you build: don't fight what the OS already does well. Use semantic HTML so VoiceOver can read it. Honor the user's font size preference instead of hard-coding a pixel value. Don't override the system caption rendering. The web app's job is to expose the right semantics; the OS's job is to deliver the experience.
Which WCAG success criteria matter most for an audio-guide web player?
Five clusters carry most of the weight: text alternatives, time-based media, distinguishable, keyboard and navigable, and name/role/value. The complete WCAG 2.1 AA set is 50 success criteria, and a well-built audio-guide app should aim to meet all of them. But the failures we see in shipped tours concentrate in a much smaller set. The rest of this piece covers each cluster with the specific implementation moves that satisfy the criteria.
A useful framing: about a third of the work is what you put into the content (alt text, transcripts, captions, plain language), about a third is what you build around the audio (player controls, keyboard support, focus management), and about a third is the visual layer (contrast, type, reflow). Skip any of the three and the tour fails for someone.
Text alternatives: alt text on every artwork image (1.1.1)
Every artwork image in the tour needs alt text that does the same job as the audio. WCAG 2.1, Success Criterion 1.1.1 (Non-text Content, Level A) requires that all non-text content presented to users has a text alternative serving the equivalent purpose. For a museum audio guide, that almost always means the artwork thumbnail at each stop, the cover image for each tour, the venue map, any decorative photography, and — easy to miss — the icons in the player controls themselves.
The default failure mode: empty alt="" on every artwork because no one wrote the descriptions, or alt text that says "image of a painting" and stops there. Neither is accessible. A useful pattern is to treat the alt text as a one-sentence visual description that the screen reader user gets in addition to the audio: the audio is the interpretation, the alt text is the picture. For a Vermeer that's "A young woman in a yellow jacket reading a letter by a window, light from the left," not "Painting." This is the place to spend the editorial energy.
Decorative imagery — patterns, borders, decorative backgrounds — should carry alt="" so screen readers skip it. Player-control icons need accessible names (covered under 4.1.2 below), not redundant alt text on the icon image itself.
Time-based media: captions, transcripts, audio description (1.2.x)
The audio is the product. The captions and transcript are the audio for users who can't hear it. WCAG 2.1 covers time-based media across several criteria: 1.2.1 (alternatives for audio-only and video-only), 1.2.2 (Captions, Level A), 1.2.3 (Audio Description or Media Alternative, Level A), and 1.2.5 (Audio Description, Level AA).
For a museum audio guide, the practical asks are: every audio stop has a synchronized transcript that a deaf or hard-of-hearing visitor can read at the stop, with playback position highlighted as the audio plays. The transcript is also the canonical source for the audio description — if a stop has visual content the audio doesn't fully describe (the painting, the room, the artifact), the transcript should include that visual description, and either it's narrated as part of the main audio or offered as an alternate "extended" track.
A practical move that avoids most of the 1.2.x complexity: write the script with the visual description embedded ("To your left, a large canvas in oils — a young woman in a yellow jacket reading a letter…"). Then the same script is the audio, the transcript, and the audio description. One artifact, three formats, no separate workstreams. This is the pattern Smithsonian and several major museums have moved toward for new tours.
For the hearing-specific accommodations side of this — induction loops, ASL, captioning quality — see our hearing accommodations piece.
Distinguishable: contrast, resize, and reflow (1.4.x)
The web player has to be readable in the gallery, by every visitor, at the type size and contrast they set themselves. Three criteria carry most of the work here:
- 1.4.3 Contrast (Minimum) (AA): body text at a 4.5:1 contrast ratio against its background, large text (18pt+ or 14pt bold+) at 3:1. Museum galleries are often deliberately dim — a player that looks fine in your office can be unreadable next to a dimly-lit Rothko. Test in low light, not in the design tool.
- 1.4.4 Resize Text (AA): the player has to remain usable when the visitor sets system text to 200%. The failure mode is hard-coded pixel font sizes; the fix is rem-based or em-based typography that honors the system root font size. iOS Dynamic Type and Android Font size both flow through to the browser if you let them.
- 1.4.10 Reflow (AA): content reflows to a 320 CSS pixel viewport without horizontal scrolling. Most modern responsive layouts pass this by default; the failure mode is a fixed-width player or a transcript panel that overflows on narrow screens.
- 1.4.11 Non-text Contrast (AA): UI components and graphical objects (the play button, the progress bar, the language selector) need 3:1 contrast against their background. A ghosted-out play icon on a translucent overlay routinely fails this.
The single highest-leverage move is to honor the system text size all the way through the design system. If your typography is built in rems and your spacing follows it, almost everything else falls into place.
Operable: full keyboard support and a sane focus order (2.1, 2.4)
Every control has to work without a mouse or touch, and the focus order has to make sense. WCAG 2.1.1 (Keyboard, Level A) requires that all functionality is operable through a keyboard interface. WCAG 2.4.3 (Focus Order, Level A) requires that the order in which components receive focus preserves meaning and operability. WCAG 2.4.7 (Focus Visible, Level AA) requires a visible focus indicator.
For a web player in a museum, the keyboard-user population is small but not zero — and more importantly, switch-control users and voice-control users (Voice Control on iOS, Voice Access on Android) drive the page through the same accessibility tree. If your custom play button can't be activated by keyboard, it can't be activated by voice either.
The recurring failure modes in audio-guide players we've audited:
- Custom play/pause buttons built with
<div>and anonClickhandler. A<button>element is keyboard-accessible by default; a<div>is not. Use the native element. - Progress bars built with mouse-only drag handlers. A real
<input type="range">gets arrow-key support, screen-reader announcement, and voice control for free. - Modal "tip" overlays that trap focus and have no escape. A modal should trap focus while open and return focus to the trigger element on close. The "trap with no escape" is a literal WCAG fail.
- "Skip 15 seconds" controls with no keyboard binding. Add
tabindex="0"and a key handler, or use a real button.
Understandable: language, consistency, and plain interpretation (3.x)
Tell the browser what language each piece of content is in, keep the interface consistent across the tour, and write the interpretation at a reading level a general audience can follow. The relevant criteria here:
- 3.1.1 Language of Page (Level A): set the
langattribute on the document. For a multilingual tour, the pagelangshould match the active tour language; per-passagelangattributes on individual quotes or terms in other languages let screen readers pronounce them correctly. - 3.2.3 Consistent Navigation (Level AA): the player controls, stop navigation, and language switcher should appear in the same place across every stop. Tour-specific custom layouts that move the play button around fail this.
- 3.3.1 Error Identification (Level A): when something goes wrong — a stop won't load, a language download failed, the offline cache is full — say so in text the user can read, not just an icon or a color change.
There's also a content-side reading: WCAG doesn't dictate a specific reading level for AA, but plain-language interpretation written at roughly an 8th–10th grade level serves cognitive accessibility, English-as-a-second-language visitors, and general engagement at the same time. This is the place where the multilingual posture pays a second dividend — see our multilingual interpretation pillar.
Robust: name, role, and value on every interactive element (4.1.2)
Every player control needs an accessible name, an explicit role, and a state the screen reader can read. WCAG 4.1.2 (Name, Role, Value, Level A) is the criterion that catches custom UI. A native <button> with text inside it satisfies the criterion automatically. A <div role="button" aria-label="Play"> can satisfy it if done correctly. A <div onClick> with an SVG icon and no other markup does not.
The audit pattern: open the page in a screen reader (VoiceOver on Mac with Safari, or NVDA on Windows with Firefox), tab through every interactive element on the player, and confirm three things for each: it announces a clear name ("Play", not "button"), it announces its role ("button", "slider", "menu"), and when its state changes (playing → paused, language English → Spanish) the change is announced. WCAG 4.1.3 (Status Messages, Level AA) extends this to ephemeral status: "Loading", "Audio unavailable", "Offline".
This is the criterion where the custom-audio-player anti-pattern shows up most often. The reusable rule: if a control isn't a real <button>, <input>, <select>, or <a>, every attribute the browser would have given it for free now has to be added by hand — and tested with a real screen reader, not an automated scanner.
Where automated scanners help, and where they don't
Automated tools catch roughly 30–40% of WCAG issues; the rest require manual testing. axe-core, Lighthouse, and WAVE are useful and worth running on every release. They reliably catch missing alt text, low contrast on standard text, missing form labels, and basic ARIA misuse. They do not catch: whether your alt text is meaningful, whether the focus order makes sense, whether a modal trap returns focus correctly, whether the screen reader actually reads your custom player in a way a blind visitor can use, or whether the captions are synchronized with the audio.
The minimum monthly cadence we'd recommend for a production tour: an automated scan on every release, a manual keyboard-only walkthrough quarterly, and a screen-reader walkthrough with VoiceOver-on-Safari and TalkBack-on-Chrome twice a year. For procurement evaluations, ask the vendor for their most recent VPAT (Voluntary Product Accessibility Template) and the date of their last manual audit, not just their Lighthouse score.
What about WCAG 2.2 and the upcoming 3.0?
Design to 2.1 AA for now; add the small set of 2.2 additions where they're cheap; don't wait for 3.0. WCAG 2.2 was published in October 2023 and adds nine new success criteria, including focus appearance, dragging movements, target size, consistent help, and redundant entry. None of them break a 2.1-conformant design; most are common-sense extensions. The DOJ rule names 2.1 AA explicitly, but procurement teams are increasingly asking about 2.2 conformance as well — it's worth tracking.
WCAG 3.0 ("Silver") is in working-draft and will not be the legal standard for years. It's a fundamental restructuring rather than an incremental update. Design to 2.1 AA today, add 2.2 where the cost is low, and follow 3.0 in the background.
Where this doesn't fully work
The honest section. A BYOD web app inherits enormous accessibility surface for free — but it doesn't solve everything, and pretending otherwise is the failure mode this pillar exists to push back against.
Visitors who don't carry a smartphone, can't operate one comfortably, or run an older device that the OS no longer updates with current accessibility features need an accommodation strategy that doesn't depend on their device. That's where loaner phones, large-print transcripts at the front desk, ASL tour bookings, and trained docents come in. A web player is the default, not the universal answer; for the cases it doesn't cover, see our ADA requirements and hearing accommodations pieces.
The other thing a web app can't do alone is fix the institution's content posture. If the underlying interpretation isn't written with visual description, isn't available as a transcript, and isn't paced for cognitive accessibility, no amount of WCAG compliance on the player makes the tour accessible. The technical floor is necessary; it isn't sufficient.
FAQ
What's the right next step?
If you're writing accessibility requirements into an RFP, name WCAG 2.1 AA as the technical standard, ask vendors for their most recent VPAT, and ask specifically about the player controls — custom audio UI is where almost every shipped tour quietly fails. If you're auditing an existing tour, start with a keyboard-only walkthrough of the player and a VoiceOver walkthrough of the stop screen; those two surface most of the high-severity issues in under an hour.
For the full accessibility picture — ADA, hearing accommodations, audio description, and the editorial side of accessible interpretation — see the accessibility pillar guide. For how this fits the broader category of phone-based audio guides, see the AI audio guides pillar.
About the author
Eric Duffy is the founder of Convo, a platform that lets museums and cultural institutions publish multilingual audio tours their visitors can have a conversation with. He writes about the technical and editorial work of accessible museum interpretation from inside the category, drawing on procurement language across the sector, WCAG audit work on the Convo web app, and conversations with accessibility leads at museums of every size. Reach him at eric@convo.app or on LinkedIn.