How UTF-8 Encoding works


I spoke yesterday about Unicode, and the difference between the Unicode character set, and specific encodings of this character set. This post is a follow-up which describes in detail one particular and popular character set - UTF-8.

It’s imortant to understand that encoding is simply a means of representing a Unicode character in terms of bytes on a disk or in a network stream. In the same sense, the character λ can be encoded in HTML as one of: &lamda; , λ , or λ. These HTML entities are another form of character encoding.

To understand how UTF-8 encodings work, we have to be really certain that you have it in your head that a byte and a character are different things. Just like encoding special characters in HTML, a character in UTF-8 can be made up of one or more bytes. Such encoded characters are called muti-byte characters, because they require 16, 24, 32 (or sometimes but rarely more) bits to represent this character, corresponding of course to 2, 3, and 4 bytes.

Because each character is not made up of a fixed number of bytes, UTF-8 is called a variable-width encoding. Variable-width encodings are popular because they can reduce the storage size of the most common members of that encoding. For example, this page is encoded in UTF-8, and yet almost every character in its source code will only require a single byte.

The UTF-8 encoding is popular because especially for speakers of Latin-based languages (Europe, USA, et al.) it only requires one byte per character for our most common printed characters. Also, and unlike some other character encodigns, existing text encoded as ASCII (the old IBM PC days of 7 bits per byte - the first 128 characters in an ASCII table) translates into UTF-8 without any need for conversion. UTF-8 also supports a very large range of characters (its charater repertoir).

To understand the way UTF-8 works, we have to examine the binary representation of each byte. If the first bit (the high-order bit) is zero, then it’s a single-byte character, and we can directly map its remaining bits to the Unicode characters 0 - 127. If the first bit is a one, then this byte is a member of a multi-byte character (either the first character or some followup of it).

For a multi-byte character (any character whose Unicode number is 128 or above), we need to know how many bytes will make up this character. This is stored in the leading bits of the first byte in the character. We can identify how many total bytes will make up this character by counting the number of leading 1’s before we encounter the first 0. Thus, for the first byte in a multi-byte character, 110xxxxx represents a two-byte character, 1110xxxx represents a three-byte character, and so on.

For secondary bytes in a multi-byte UTF-8 encode character, the first two bits are reserved to indicate that this is a follow-up byte. All secondary and beyond characters are of the form 10xxxxxx. It may seem wasteful to sacrifice two bits of each follow-up byte since by the time we reach this byte, we already know that it’s part of a multi-byte character. Indeed, it is wasteful, but it is intentional, and intended to address systems which do not properly handle multi-byte character encodings. For example, this enables us to identify if a string was encoded with an alternate character encoding but is being parsed as UTF-8. It also enables us to identify strings which were incorrectly truncated.

So characters in UTF-8 can be represented in binary as:
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

If you’re interested in further reading regarding Unicode and character encodings, a good place to start would be Wikipedia:
http://en.wikipedia.org/wiki/Unicode
http://en.wikipedia.org/wiki/Character_encoding

,

  1. #1 by Anonymous at September 16th, 2008

    shouldn’t a byte of the form 1110xxxx represent a two byte character and not 3 byte character as the article says?

  2. #2 by Anonymous2 at October 16th, 2008

    This was a great and very helpful article. I’m working on an eLearning Application Security module and need to explain the dangers of not declaring the encoding scheme.

    See here: http://ha.ckers.org/blog/20060817/variable-width-encoding/
    for more information about the vulnerability.

    Your explanation clarified a few things that I couldn’t figure out from the Wikipedia explanation.

    To the poster before me..
    I can’t say for sure as most of my information is coming from this article, but I think the first 1 denotes that the character will be multibyte, and all trailing 1’s before a 0 says how many bytes will follow for that character, not the number of bytes total.

    Hope that is accurate and clarifies the subject.

    Thanks again!

  3. #3 by Anonymous2 at October 16th, 2008

    oops, got it backwards - I need to learn to read.

    The number of 1’s denotes the TOTAL number of bytes in the character.

  4. #4 by Luciano Ramalho at November 10th, 2008

    No, the post is right. See:
    http://www.ietf.org/rfc/rfc3629.txt

(will not be published)
  1. No trackbacks yet.