Thursday, May 25, 2006

Character encoding, character references and Windows-1252

Consider this XML:

<?xml version="1.0" encoding="Windows-1252" ?>
<foo>&_#150;0x96</foo>

Here there is the character reference #150 and the actual character 0x96 that character reference #150 resolves to. The underscore is used to prevent the character reference being resolved by the browser, and the string '0x96' is used instead of the actual character 0x96 because blogger complains otherwise (stick with me...) Note the encoding used in the prolog.

Transforming that XML with this stylesheet:

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
<xsl:output encoding="US-ASCII"/>
<xsl:template match="/">
<xsl:copy-of select="."/>
</xsl:template>
</xsl:stylesheet>

Gives this result:

<foo>&_#150;&_#8211;</foo>

Here you can see the character reference has been written back out as the same character, however character 0x96 has become 0x2013 (8211 in decimal). Character 150 is the unicode character "START OF GUARDED AREA" in the non-displayed C1 control character range, but in the Windows-1252 encoding it's mapped to to the displayable character 0x2013 "en-dash" (a short dash). Microsoft squeezed more characters into the single byte range by replacing non-displayed control characters with more useful displayable characters, but mistakenly went on to label files encoded in this way as ISO-8859-1 in some MS Office applications. In ISO-8859-1 the characters in the C0 and C1 ranges are the non-displayable control characters, but this mis-labelling was so widespread that parsers began detecting this situation and silently switching the read encoding to Windows-1252.

This problem surfaces when serving XHTML to an XHTML browser. While browsers are reading files using their HTML parsers, any file mis-labelled as ISO-8859-1 that contains characters in the C0 or C1 ranges will still auto-magically display the characters in those ranges, as the forgiving parsers auto-switch the read encoding. However, when an XHTML file is served (using the correct mime type eg "application/xhtml+xml") XHTML browsers such as Firefox will parse the file using its stricter XML parser - and all the characters in the C0 and C1 ranges will remain as those non-displayed characters. The auto-switch won't take place and characters such as 0x96 (en-dash) that were once displayed will just disappear.

This problem only occurs when an XML file is saved in Windows-1252 but is labelled as something else, usually IS0-8859-1. The most common culprit is Notepad, where a user has edited and saved an XML file without realising/caring that Notepad is unaware of the XML prolog.

So back to the example above:

<?xml version="1.0" encoding="Windows-1252" ?>
<foo>&_#150;0x96</foo>

The main point to realise here is that character references (the #150 in the example) are always resolved using Unicode codepoints, regardless of the specified encoding. Actual characters in the file will be read using the specified encoding. Therefore the #150 is resolved to 0x96 (its Unicode codepoint), while the actual character 0x96 in the source becomes 0x2013 (#8211) as specified in the Windows-1252 encoding.

The result of the transformation demonstrates this when serialised using the US-ASCII encoding (so all bytes above 127 will be written out as character references) looks like this:

<foo>&_#150;&_#8211;</foo>

A great example I think :)

Wikipedia has lots more information: http://en.wikipedia.org/wiki/ISO_8859-1

*There are two versions of ISO-8859-1 - The ISO and IANA versions. The ISO versions doesn't contain the C0 and C1 control characters, the IANA version does contain them. The XML recommendation uses the IANA version.

No comments: