Captions and Transcripts

Video and audio require captions for deaf and hard-of-hearing users, and transcripts for content that cannot be watched or listened to in its original form.

Audio and video content excludes users who cannot hear it — unless the same information is available in a form they can access.

Captions versus subtitles

Captions and subtitles are not the same. Subtitles translate spoken dialogueue into text for users who do not speak the language. Captions provide all audio content — dialogueue, speaker identification, sound effects, and music descriptions — for users who cannot hear the audio. WCAG requires captions for all pre-recorded synchronised media (video with audio track). For live content, real-time captions are required at AA conformance.

What captions must include

Adequate captions transcribe all speech accurately, identify speakers when there are more than one, and describe meaningful non-speech sounds: [music playing], [door slams], [laughter]. They must be synchronised with the audio — appearing at the right moment and staying on screen long enough to read. Auto-generated captions are not sufficient without review — they fail on accents, technical terms, proper nouns, and rapid speech. They are a starting point, not a finished product.

Transcripts

A transcript is a text document containing all of the audio content of a video or audio file. For audio-only content (podcasts, voice recordings) a transcript is the primary accessibility mechanism — captions are not applicable without video. For video content, a transcript provides an alternative that users can read at their own pace, search, copy, or access when they cannot watch video at all. WCAG requires transcripts for pre-recorded audio-only content.

Audio descriptions

Audio description — a narrated description of visual content that is not conveyed by the audio track — is required for pre-recorded video. A diagram being discussed on screen that is not described in the audio track is inaccessible to blind users. Audio description tracks, extended descriptions, or a text transcript that includes visual descriptions are all valid approaches.

Real-time captions

For live content — live events, webinars, live streams, real-time support — pre-prepared captions are not possible. WCAG 2.1 AA requires real-time captions for all synchronised live media. Automated speech recognition (ASR) services have reached sufficient accuracy for most speakers in controlled environments, but accuracy drops for accented speech, technical vocabulary, and noisy conditions.

Live captions should: appear with minimal delay (under 5 seconds), remain on screen long enough to read, identify speakers when there are multiple, and be made available as a reviewed transcript after the event — the raw ASR output is a starting point, not a finished record.

Sign language interpretation

Sign language interpretation provides access for deaf users whose primary language is a sign language rather than a written form of the spoken language. BSL (British Sign Language) and ASL (American Sign Language) are distinct languages, not signed versions of English — a BSL user may not read English fluently, making captions alone insufficient.

WCAG AAA (Success Criterion 1.2.6) requires sign language interpretation for pre-recorded synchronised media. For organisations serving communities where sign language is a primary language — public sector, healthcare, legal services — it is a practical necessity regardless of conformance level. Implementation options: a picture-in-picture interpreter in the video corner, or a separate interpreter video provided alongside the primary content.

Transcript structure and formatting

A transcript is not a raw dump of speech. A well-formatted transcript serves users who are reading rather than listening or watching, and that reading experience has its own requirements.

Speaker identification — when there are multiple speakers, each speaking turn must be labelled. The format should be consistent: “SPEAKER NAME: content” or “[Speaker Name]” on its own line before their text. This is especially important for interview or conversation formats where speaker changes are frequent.

Timestamps — include timestamps at regular intervals (every 1–2 minutes, or at each new topic) so users can jump to a point in the audio or video that corresponds to content they are interested in. Timestamps also allow users to verify that the transcript matches the content at a specific point.

Non-speech content — describe meaningful non-speech sounds inline: [applause], [music fades], [slide change]. Describe visual content that is discussed but not narrated: [speaker shows a diagram of a grid system]. These descriptions are what make a transcript accessible to blind users watching video, not just deaf users.

Paragraph breaks — break long speaking runs into paragraphs at natural topic transitions. A wall of text with no paragraph structure is harder to read and harder to search.

Corrections vs raw output — auto-generated transcripts are a starting point. They regularly fail on proper nouns, technical terms, accents, and crosstalk. Every transcript should be reviewed and corrected before publishing. The correction burden is the strongest argument for budgeting transcript production into the content workflow, not treating it as an afterthought.

The takeaway

Every piece of video content you publish needs synchronised captions. Every audio-only file needs a transcript. Budget this into production, not post-production. Auto-generated captions should always be reviewed and corrected before publishing.