Unicode: The absolute minimum every developer should know.


Over the past few years, I’ve been engaged in a few projects which have required international support. Via trial and fire, I’ve learned a fair amount about Unicode and character encoding. I now consider this to be essential knowledge for all programmers – web programmers especially. This becomes particularly important when you start dealing with strict schemas such as XML, which may care greatly what the individual characters are, and whose schema may be broken by invalidly encoding some characters.

There’s a great article here with the absurdly long yet very much accurate title of The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky of Joel on Software fame.

It’s important to realize that Unicode is just a really big alphabet. It’s a collection of letters, glyphs, and symbols from a lot of different languages, some non-languages (eg, there’s a smiley with an eyepatch), and even some fake languages (I believe there’s some Klingon in there). It specifies nothing about what bytes you use to represent a given Unicode character, it just says that character 12345 should be such and such a symbol.

Next comes encoding. Encoding defines how we map a given Unicode character into bytes. There are many encoding schemes out there, some of them favor one language or another (ISO-8859-1, which is common in Microsoft applications favors European languages).

Not all encodings are able to represent the entire Unicode character set. The characters which can be represented by a given character set is known as that set’s character repertoir. The encoding specifies both how to represent characters in terms of bytes on a disk, and also which subset of the Unicode character set this encoding represents. The same bytes interpreted as a different character encoding will yield different glyphs.

Some encodings specify that each character is made up of some pre-determined, fixed number of bytes, while others specify that different characters are made up of different numbers of bytes – usually the most common characters are a single byte, and less common characters are multiple bytes.

One of the most common encodings is UTF-8. This is an incredibly popular character encoding, and for good reason – it’s very flexible, and uses only a single byte for the most common characters, making it also one of the most compact. I’ll talk about it some more tomorrow.

  1. #1 by Brian at May 3rd, 2008

    It is one thing to know unicode and another thing to know how to use it correctly on a web app.

    Detecting language settings and displaying the right language would make for a very interesting topic.

  2. #2 by Ben Nadel at May 7th, 2008

    Good stuff. Thank for the very simplified description. I get confused because there is encoding in ColdFusion, right? But then also in the browser as well? I never set any of it and it seems to work. I guess its all UTF-8 by chance or something?

  3. #3 by eric stevens at May 7th, 2008

    There’s always encoding of some form. Sometimes encoding might only be “ASCII,” which only supports the first 256 characters (anything funky, like trademark, m-dash, etc, gets dropped or misrepresented). ASCII is a fixed-width 1-byte-per-character encoding.

    Your browser will automatically encode whatever it sends to the server in whatever encoding it prefers most (in the US that might be ISO-8859-1). The web server decodes this into Unicode, and decodes (or at least attempts to) all the other work you’re doing into Unicode too.

    Once its processing is done, it encodes the output into some character encoding for display on the web page. You’ll notice in your html head tag, there’s probably typically a meta tag which specifies the encoding. Hopefully your server and that tag agree on encoding.

    ColdFusion by default outputs as UTF-8, so if your HTML meta tag doesn’t agree with this, you should change your meta tag to match CF.

    So everything gets converted to Unicode for in-memory processing, no matter its original encoding. If you see funky characters, it’s probably because it was interpreted as the wrong encoding. Then for output gets encoded as something else (in CF’s case, UTF-8).

  4. #4 by Ben Nadel at May 8th, 2008

    @Eric,

    Thanks for the explanation. That makes sense. Unfortunately, I don’t have meta tags in my head for encoding :( I guess that is something that I should start using. It works for me and my computer, but I guess I have to make it work for other people as well.

  5. #5 by eric stevens at May 8th, 2008

    Alternatively you can specify
    <cfcontent type=”text/html;charset=utf-8″>

    With HTML in particular there are several ways to indicate encoding.

  6. #6 by Ben Nadel at May 8th, 2008

    @Eric,

    Thanks. That seems pretty easy.

(will not be published)
  1. No trackbacks yet.