Nov 29, 2006

When good encodings go bad

In the past year, I've been doing some work with Unicode as part of the PHP6 upgrade. I've learned more than I wanted to know about all sorts of encodings from UTF-7 to koi8-r to good old iso-8859-1. I've picked apart the picayune differences between UCS-2 and UTF-16, and played the game of surrogate pairing and orphaning. Despite all that exposure however, I wasn't prepared when a question crossed my inbox about a lesser known encoding called AL32UTF8.

I'd never heard of this one before, so I went to my favorite search engine for some answers. Turns out it's something Oracle came up with and later got adopted as a proper standard with the name CESU-8. At first glance, CESU-8 looks identical to UTF-8 in the same way that UCS-2 looks a lot like UTF-16. In fact every codepoint from U+0000 to U+FFFF is encoded identically under both sets of rules: 16 bits, split up over one, two, or three bytes, with leftover bits framing the encoding protocol.

When you jump up above U+FFFF however, into the realm of CJK codepoints and the like (such as my personal nom du pointe: 𣚺) something funny starts to happen. In the UTF8 world, these codepoints are accomodated by adding one extra byte to the mix which allows for up to 22bits of data (All of unicode only requires 21). In the CESU-8 world however, the code point is split according to UTF-16 surrogacy rules making two separate unicode points (each in the range U+D800 - U+DFFF). These two unicode points are then encoded individually into UTF-8 sequences. This means that we've now promoted our variable length (4 max) multibyte encoding to a variable length (6 max) multibyte-multibyte encoding. Thank you Oracle. Thank you for adding complexity to encoding rules while increasing data storage requirements. What would the world do without you?

P.S. - Java is at fault too... its 'Modified UTF-8' uses nearly identical rules.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.