Nov 29, 2006

When good encodings go bad

In the past year, I've been doing some work with Unicode as part of the PHP6 upgrade. I've learned more than I wanted to know about all sorts of encodings from UTF-7 to koi8-r to good old iso-8859-1. I've picked apart the picayune differences between UCS-2 and UTF-16, and played the game of surrogate pairing and orphaning. Despite all that exposure however, I wasn't prepared when a question crossed my inbox about a lesser known encoding called AL32UTF8.

I'd never heard of this one before, so I went to my favorite search engine for some answers. Turns out it's something Oracle came up with and later got adopted as a proper standard with the name CESU-8. At first glance, CESU-8 looks identical to UTF-8 in the same way that UCS-2 looks a lot like UTF-16. In fact every codepoint from U+0000 to U+FFFF is encoded identically under both sets of rules: 16 bits, split up over one, two, or three bytes, with leftover bits framing the encoding protocol.

When you jump up above U+FFFF however, into the realm of CJK codepoints and the like (such as my personal nom du pointe: 𣚺) something funny starts to happen. In the UTF8 world, these codepoints are accomodated by adding one extra byte to the mix which allows for up to 22bits of data (All of unicode only requires 21). In the CESU-8 world however, the code point is split according to UTF-16 surrogacy rules making two separate unicode points (each in the range U+D800 - U+DFFF). These two unicode points are then encoded individually into UTF-8 sequences. This means that we've now promoted our variable length (4 max) multibyte encoding to a variable length (6 max) multibyte-multibyte encoding. Thank you Oracle. Thank you for adding complexity to encoding rules while increasing data storage requirements. What would the world do without you?

P.S. - Java is at fault too... its 'Modified UTF-8' uses nearly identical rules.

Nov 5, 2006

Don't worry, I slept last Thursday

On this, my first foray to Europe, indeed my first real trip outside the US (Those couple hours in Tijuana don't count), I'm faced with one undeniable, inexcapable fact. Jet lag sucks.

It doesn't help that all last week I was staying up past my normal bedtime partying with the attendees of ZendCon06, but I think my real mistake was trying to outsmart my own circadian rhythm. See, I figured "I've got this long flight across the atlantic, it'll go faster if I can fall asleep at some point." Seems reasonable so far. How to ensure sleep? Why, stay up all night before the flight. Brilliant! But wait, what if I can't fall asleep during the flight?

Sometime sunday morning, my plane lands in Frankfurt and I wander, zombie-like, through passport control, baggage claim, and customs somehow managing to board the right shuttle to reach the conference hotel. Based on advices from battle-hardened globetrotters, I was planning to put in a one hour power nap, then go for a walk to "reset" my internal clock. Unfortunately the sixty-plus hour run of consciousness had other plans and by the time I awoke, the sun had set.

Finally this morning (Monday), I managed to put in that walk, touring a nearby suburb which reminded me somewhat spookily of the setting from "Shaun of the Dead" (more zombie tie-ins). Five euros and three liter bottles of diet coke later and I'm almost coherent enough to....what the hell was I gonna say? Sorry, my brain has been cutting out a lot the past few days...

Guten Morgen!