SOAP - Archived

October 21, 2015

Foreign Languages on the Web – The “Lang” attribute and Classic Latin.

Word cloud featuring multiple foreign languages

Prologue

In July of 2006, I took a position at Stanford University, to establish and run an on-line accessibility program, which ultimately become known as SOAP (Stanford Online Accessibility Program – finding the name was almost too easy). While there, I published a number of ‘whitepapers’ and other thought pieces that have subsequently been removed from the SOAP site after my departure in 2012. Since there does not appear to be an archived copy of this paper at Stanford today, I am re-posting it in the academic spirit of knowledge sharing.

By: John Foliot | Posted: April 24, 2008

Recently I was faced with a question that I really had no hard data on. Fortunately, within the circle of international colleagues I work with who specialize in “web accessibility”, I was able to gather some interesting information and opinion, and share these thoughts and observations here.

Update

After the original posting of this White Paper, it was noted that the example shown was both Latin (and more specifically Medieval Latin) and Middle English. As such the correct markup would be: <span lang=”la”> Incipit:</span> <span lang=”enm”>Holi writ haþ a liknesse to tre þat bereþ noote oþer appel</span>. Obviously this will further impact on the decision process of the project in question as I am most certain that there are currently no speech synthesizers for Middle English, although the other reasons given for using the “lang” attribute remain valid. As well, while Medieval Latin and Classical Latin can differ, there is currently only one ISO code for Latin, being “la”.

The Problem:

A web page (or specifically a series of web pages) written in English also features extensive tracts of Classical Latin text – text originating from the 12th and 13th century. The W3C WCAG1 guidance states: Clearly identify changes in the natural language of a document’s text and any text equivalents (e.g., captions). (Priority 1, Checkpoint 4.1)

The question however was whether or not undertaking the non-trivial task of marking up these Latin texts to meet the WCAG Requirement was worth the return on investment? (<span lang=”la”> Incipit: Holi writ haþ a liknesse to tre þat bereþ noote oþer appel</span>)

There is the possibility that doing so would still not satisfy a key constituency, screen readers, in a practical way as it was not clear that a Latin Speech Synthesizer even existed today. As well, “Screen readers without Unicode support will read a character outside Latin-1 as a question mark, and even in the latest version of JAWS, the most popular screen reader, Unicode characters are very difficult to read.” [1]

Opinions and Facts:

The facts and opinions that ensued, prompted by the question, centered on the following relevant points

Screen Readers / Screen Reading Software
The number of languages supported by JAWS (the leading screen reading software package in the marketplace) is not limited to the list at the software vendor’s website [2] as local distributors, for example Freedom Scientific Benelux, can deliver a JAWS version with a speech synthesizer for Dutch. It could not be determined however whether a JAWS speech synthesizer for Latin currently exists.[3] However, there is a sizable cottage industry of JAWS scripters who could add support for the characters even if it is not currently available. Given that this is an academic project it is conceivable that a blind researcher may wish to tap into this scripting resource, to add the capability if documents exist that would become more accessible with the investment.

Even if there were no speech synthesis available for a language, screen readers like JAWS can announce language changes and users can associate particular voice configurations with particular languages.

Looking beyond JAWS, Classical Latin [4] is among the current MBROLA voices [5] available. It is therefore (at least theoretically) usable with at least some screen readers and text-to-speech software, e.g. NVDA [6], FreeTTS (used by FireVox) [7], and Emacspeak [8].

Typesetting / Alternate Usage
As this is an academic project it might be more important to correctly mark-up the language for reasons other than accessibility. It is possible to machine-process words or even phrases in various useful ways, e.g. for machine translation. It is significantly more successful if you know for sure what language you are dealing with.

For example, if a user opened your HTML page in a word processor such as Microsoft Word, it would use the language markup, and this can be relevant when spelling checks are “on”, i.e. words classified as misspelled are highlighted. Declaring Latin words as Latin prevents the program from applying English spelling rules to them. (A copy of Word tested for this seemed to be Latin-ignorant. That is, it recognized the words being in Latin but did not flag anything as misspelled and did not even hyphenate Latin words. This is probably better than treating them as English or some other language.)

Even when the language markup is correct however, search engines (such as Google) and related tools do not necessarily use this information today. One respondent found web pages in Dutch, with correct language markup, that still showed up in search results even when he explicitly asked Google to return only pages in English.

Regarding Character Support, this is a different issue and should not depend on language
markup, and mostly doesn’t. Generally, in special software like screen readers or specialized browsers, we should expect character support to be more restricted than in common modern browsers. Even Latin-1 isn’t as safe as in “normal” browsing. For example, what would a screen reader do upon encountering a special character like “¶”? Would it recognize it as having a special meaning (paragraph separator) and make a pause? It probably spells it out. This might mean saying “pilcrow sign”, perhaps independently of the language being used (since characters names aren’t widely localized – most characters don’t even have a name in most languages), which might be complete gibberish even to people who understand normal English.

Current and Future Technology
Style sheets, either page or user style sheets, could be used to style words in a particular language as different from others, using a selector like [lang=”la”] or :lang(la). However, this does not work in all browsers, such as IE 6, which does not recognize such selectors. On some browsers, like Firefox, the user can right-click on a word and get information about its language*. Finally, some day some browsers or other software could make real use of the markup.

(* Firefox users can test this directly from this page: one of the contributors to this document works at Katholieke Universiteit Leuven – place your mouse over this name, right click and choose ‘Properties’)

Conclusion

While current support for foreign languages such as Latin remains minimal in 2008, there does exist at least some compelling reasons to consider marking up existing content using the “lang” attribute. Outside of strict conformance to a W3C WCAG Priority 1 requirement, future-proofing the content and enhancing it’s usability suggest that the Return-on-Investment can be justified. It remains however the decision of the content owner to make the final call.

A special note of thanks goes out to the following contributors, who provided much of this information, and have been quoted (often verbatim) in this white paper:


  1. http://en.wikipedia.org/wiki/Wikipedia:Accessibility
  2. http://www.freedomscientific.com/fs_products/software_jawsinfo.asp
  3. http://lists.w3.org/Archives/Public/w3c-wai-gl/2005AprJun/0097.html
  4. http://tcts.fpms.ac.be/synthesis/mbrola/demo/la1.wav – NOTE: male voice, 188K Wav file
  5. http://tcts.fpms.ac.be/synthesis/mbrola.html
  6. http://www.nvda.fr/spip.php?article14
  7. http://freetts.sourceforge.net/
  8. http://web.mit.edu/ATIC/src/emacspeak-9.0/mbrola
Category: Product #: Regular price:$ (Sale ends ) Available from: Condition: Good ! Order now!

CC BY-NC-SA 4.0 Foreign Languages on the Web – The “Lang” attribute and Classic Latin. by John Foliot is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Posted by John

I am a 16 year veteran of Web Accessibility, living and working in Austin, Texas. Currently Principal Accessibility Strategist at Deque Systems Inc., I have previously held accessibility related positions at JPMorgan Chase and Stanford University. I am also actively involved with the W3C - the international internet standards body - where I attempt to stir the pot, fight hard for accessibility on the web, and am currently co-chairing a subcommittee on the accessibility of media elements in HTML5.

View more posts from this author

Leave a Reply