Archive for May 2nd, 2008

Unicode: The absolute minimum every developer should know.

Over the past few years, I’ve been engaged in a few projects which have required international support. Via trial and fire, I’ve learned a fair amount about Unicode and character encoding. I now consider this to be essential knowledge for all programmers – web programmers especially. This becomes particularly important when you start dealing with strict schemas such as XML, which may care greatly what the individual characters are, and whose schema may be broken by invalidly encoding some characters.

There’s a great article here with the absurdly long yet very much accurate title of The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky of Joel on Software fame.

It’s important to realize that Unicode is just a really big alphabet. It’s a collection of letters, glyphs, and symbols from a lot of different languages, some non-languages (eg, there’s a smiley with an eyepatch), and even some fake languages (I believe there’s some Klingon in there). It specifies nothing about what bytes you use to represent a given Unicode character, it just says that character 12345 should be such and such a symbol.

Next comes encoding. Encoding defines how we map a given Unicode character into bytes. There are many encoding schemes out there, some of them favor one language or another (ISO-8859-1, which is common in Microsoft applications favors European languages).

Not all encodings are able to represent the entire Unicode character set. The characters which can be represented by a given character set is known as that set’s character repertoir. The encoding specifies both how to represent characters in terms of bytes on a disk, and also which subset of the Unicode character set this encoding represents. The same bytes interpreted as a different character encoding will yield different glyphs.

Some encodings specify that each character is made up of some pre-determined, fixed number of bytes, while others specify that different characters are made up of different numbers of bytes – usually the most common characters are a single byte, and less common characters are multiple bytes.

One of the most common encodings is UTF-8. This is an incredibly popular character encoding, and for good reason – it’s very flexible, and uses only a single byte for the most common characters, making it also one of the most compact. I’ll talk about it some more tomorrow.

6 Comments