December 23, 2010

<video>, Accessibility and HTML5 Today

Back in July of 2009, I wrote a blog post spurred on by a dinner conversation with my friend Bruce Lawson. Since then, I’ve seen a few instances where people have pointed to that posting as important to understanding the issue of accessibility and video in HTML5. A lot has changed however since I wrote that piece, and I’ve been meaning to update that information for some time now. A recent email thread amongst some friends crystalized that requirement, and the following is adapted from the email note I wrote to that thread.

Accessibility in Video and Audio is tricky – we are dealing with multi-media content here, so ‘full’ accessibility becomes significantly more complex, as ‘accessibility’ for one user/user-group is not the same as the requirements for all user-groups: obvious when stated, but best kept in mind as we discuss media in HTML5. The following is *my* understanding of where things currently stand.

User Needs:

The media sub-team of the Accessibility Task Force for HTML5 (of which I am the Co-Chair) have (hopefully) identified all of the user-needs we could come up with, and it is an extensive list. The following URLs document that progress:

Which then led to us ‘grouping’ these needs into two basic constituencies:

See also:

It is important to note that this work is a collection of both author requirements as well as player (browser/user-agent) requirements. We have asked on multiple occasions for further comments (to ensure we’ve not missed anything) with no other feedback, so at this time it is presumed ‘complete’, however please let me know if you believe something is missing.

It is also important to underscore that “fallback” for <video> and <audio> is not the same as ensuring ‘accessibility’ for video or audio – it is simply intended for legacy browsers that would not support HTML5’s new elements. The best that we can say (when ‘teaching’ about this) is that it is similar to the old <noscript> construct (minus the need for an element), or the ‘fallback’ we once used to provide for Frames (and please, not “Your browser sucks, get a better browser”). While this fallback should be informative and ‘accessible’ – it is not intended to meet accessibility requirements or needs.

The <track> Element:

The <track> element
(, as a child element of <video> and <audio>, is the means where authors can specify alternatives or more correctly supporting content to the multi-media content. In the sub-team, we have taken to informally referring to the ‘video’ as the Primary Media Resource, with the alternative content being known as the Alternative Media Resource.

For now, <track> is the way we can reference time-stamped texts [I will return to that in a bit] that could/would include captions, but also sub-titles (i18n), extended text descriptions, and other potential timed text files; <track> ‘inserts’ the supplemental content into the DOM tree as children elements to <video>/<audio>. (If you have ever worked with Flash based video this is a similar authoring pattern to the <param> elements used in that environment.)

The <track> element takes the following attributes: SRC, KIND (caption, subtitle, etc.), SRCLANG, CHARSET, and LABEL, with the KIND attribute being the most important for accessibility needs. It is unclear at this time whether <track> will/could also reference supplemental multi-media content such as audio described content/extended described content, ‘picture-in-picture’ sign language files, etc. – this is on our current working agenda to be resolved.

For those who may already know this, my apologies, but a quick primer of the current HTML5 media formats is important at this time.

We currently have 3 media formats that are being discussed: MP4, OGV, and WebM. These are ‘wrapper’ formats that contain the encoded videos (using H.264 for MP4, Theora for OGV, and VP8 for WebM), but inside those wrapper formats other ‘content’ can be enclosed, including those timed-text files, meta-data files, etc. We can already do this today, and this is in fact how captioned video for iPhones, iPads, etc. is currently provided: the ‘bundling’ is a post-production process done in tools such as Final Cut Pro, QuickTime Pro etc. (see:

For video content that is ‘complete’ in this post-production process there is an API which “looks” inside the wrapper, and extracts/maps the bundled supplemental content to the <track> DOM node(s). For accessibility, this is also the ‘better’ authoring practice, as it ensures that the supplemental content remains bundled with the video (when re-purposed or shared by other sites). However, since this post-production is not always viable for all content authors (in part because there are not a lot of tools that make this simple to do today), actually authoring <track src=> (etc., etc.) is the means to associate the supplemental content as child elements of <video>, as it is being ‘hand-authored’ into the DOM. (The reason this is less optimal is that secondary users might ‘capture’ the video file, but not bother to capture the supplemental files when “copy/pasting”, thus degrading the general accessibility of those media files).

At this time, it is envisioned that a ‘menu’ of all track content would be expose-able to the end user in a fashion *similar to* an unordered menu list: again, since the <track>s are children elements of <video> (and <audio>) this appears fairly uncontroversial, although not yet implemented in any browser. (One possible solution is a focusable ‘drop-down’ included into the video controls, along with the basic start, stop and volume controls). There are already a few examples authored by Silvia Pfeiffer (under contract to Mozilla) that illustrate this method. (Note: As this is a Proof-of-Concept example, and Silvia produced this under contract to Mozilla, it works best in Firefox.)

Timestamp Formats (WebSRT, TTML, etc.):

Just as the codec ‘issue’ is still working itself out, so too it appears the time-stamp format issue. I have long predicted that this was going to be the trickiest issue we would face, and sadly it appears I was correct. We have currently 2 main formats being contemplated, and each has its strengths and weaknesses: no one format is deemed ‘complete’ as determined by our User Requirements – another exercise we undertook and have reported:

This is further complicated by the fact that the browsers today (for the most part) are leaning towards WebSRT (a viable candidate if it can be modified to fill the existing gaps identified) while commercial content producers are already moving towards a profile of TTML called SMPTE-TT ( There is also a move (currently being alluded to by Microsoft) that browsers could/should support more than one time-stamp format.

As a side-bar, I have a particular frustration with the Society of Motion Picture and Television Engineers (SMPTE) for not being involved in our discussions even though they were likely aware of them, and frustratingly, the full SMPTE-TT spec is not ‘freely’ available – they are charging $75.00 for a copy ( It is both arrogant and self-restricting, as it further distances their efforts from the non-commercial communities of web video producers. So much for an open web!

The status of all of this is very much in rapid flux, so today I can’t state which format (if any) will be included as a base-line format in HTML5 – it may in fact be punted altogether in a fashion similar to the codec ‘issue’ – hardly the best option but one that still exists. I am working with others to strive to ensure that this is not the case, but fear that lines are hardening here as well… (One important consideration at this time is that WebSRT is not a W3C technology per se, although there is no reason that this could not change, and in fact there is some tentative talk of chartering a new Working Group to do something like this – however it is too early to state if in fact this will happen, and there might be some whining about it from some WHAT WGers along the way).

Other Outstanding Issues:

At this time there are a few other issues that we are aware of, but have not teased out fully. They mostly center around support for extended descriptions, whether audio or text based. The questions mostly surround the controlling time-line (to which all other content is synced to) and a means of ‘pausing’ the video to allow for the execution of this described video. There are a few ideas floating around, but further discussion is ongoing at this time. There is also some discussion on whether or not media queries may have a role to play when offering up the appropriate supplemental files, with no clear consensus here either.

Content navigation by content structure is also an issue which requires more work. While the current spec suggests a means to navigate content via “Chapters” (a decidedly WebSRT concept), this construct is but one-level deep. The need to have a hierarchal navigation means (think H1, H2, H3) has been identified. While this type of richer structure can be achieved with TTML, it has not been fully explored as to how this would/could work in HTML5 and might require modifications to the basic JavaScript API used for those fully ‘wrapped’ media files.

As well, I am working on a Change Proposal (due early January) that would have the video ‘poster’ have the ability to take on @alt that would be different from a textual description of the media file – there is some push back on this topic from some of the engineers, but I believe, based upon other feedback, that the engineers are not fully understanding the real issue. Mark this one as a giant question mark.

Finally, in the near future, work will need to focus on creating good author guidance and instruction, not only for mainstream publication, but also written in a way so that it can be added to the WCAG Techniques For Success documentation. I hope to be able to help with this work as well, which I suspect will need to start being addressed by this summer at the latest.

  1. #1 by bruce on December 29, 2010 - 4:34 am

  2. #2 by John on December 29, 2010 - 12:26 pm

    Ho Ho Ho (indeed)!

    It is a sign of how quickly the landscape is changing that this is of little surprise. If the changes Hixie has added to WebVTT seek to address the gaps identified in WebSRT then it can be a good thing. I personally hope that this work could move into W3C space, where governments, education and other non-web industries look for ‘endorsement’ of web standards and specs.

    Likewise, if SMPTE-TT has addressed outstanding issues then we will have a real horse race: again, opening this spec to free and open public inspection would go a long way to answering the gap questions – I guess we will be looking more closely at both in the New Year.

    Stand by for more details.

  3. #3 by Sean on January 5, 2011 - 2:48 pm

    John, In case you werent aware, I was editor of the SMPTE-TT spec and member of the TF too; so there has been some cross fertilisation. Note that SMPTE is based on individual membership though and not organisation, so you could have joined in, and it has non disclosure rules that prevent discussion of standards under development.

    I too wish the SMPTE spec were freely available, unfortunately life is not like that.

  4. #4 by John on January 5, 2011 - 2:56 pm

    It would seem to me that if SMPTE wants their work to be broadly accepted and supported by all of the browsers and content producers they should be freely sharing it with any and all. Keeping it under lock and key will simply result in little-to-no take-up in the broader community. If the WHAT WG has taught us anything, it’s that any spec, without broad take-up and implementation is just another volume of work on the bookshelf.

    Despite my very focused attention to this topic, i personally am loathe to spend $75.00 of my hard-earned dollars on a spec that may never really get broad browser support. It’s not like they have to kill any trees or bind and ship this, a digital download costs the owners nothing but the minimalist of bandwidth.

  5. #5 by Sean on January 5, 2011 - 4:06 pm

    I’ll certainly forward that feedback to the SMPTE team.

  6. #6 by Terrill Thompson on February 1, 2011 - 3:19 pm

    John and others,

    What’s the future vision as to how human-recorded descriptive video will work in HTML5? In the current draft of the spec, if a track element’s @kind attribute = “descriptions”, that constitutes a text description intended for audio synthesis. Is the hope that kind=”descriptions” will ultimately be expanded to support either text-based or human-recorded description, and if so, what additional markup would be required to make it work?

    Also, I assume human-recorded description will encounter the same audio format hurdles that plagues the audio element. If that’s the case, will we need to provide description in two or more file formats, and reference each in separate track elements?

    I realize this is all in perpetual motion. Just wondering if there’s any clarity yet on the Accessibility Task Force as to what it might look like.


  7. #7 by John on February 1, 2011 - 3:27 pm

    @ Terrill: The media sub-team have not yet fully grappled with how we will associate binary data (audio descriptions, but also P in P sign language, etc.) with the main video – a few ideas have been tossed around, but no clear path forward as yet: we are, in fact starting work on this right now. (Related to this is also how to ensure these files remain synced). It is quite conceivable that the @kind attribute would serve this need as well, although again, we’ve not yet hit that point to say so definitively.

    As for encoding format issues, yes, the same problems that exist for the primary media asset will also come to play for supporting assets: the roles that the different bits and pieces play will be different, but the encoding issues remain regardless of the role of any one asset.

  8. #8 by Joshua Kinberg on June 21, 2011 - 2:29 pm

    The SMPTE-TT spec is now freely available:

You must be logged in to post a comment. Here's Why.