UTF-8

UTF-8 uses variable byte to store a Unicode. In different code range, it has its own code length, varies from 1 byte to 3 bytes for UCS 2 and up to 6 bytes for UCS-4. Because it varies from 8 bits (1 byte), it is so called "UTF-8". UTF-8 is suitable for data transfer on the Internet, networks or some kind of applications that needs to use slow connection (e.g. through modem). For English text transfer, UTF-8 provides a much shorter data stream than other Unicode transformation format.

In UTF-8, characters are encoded using sequences of 1 to 6 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character value. In a sequence of n octets, n>1, the initial octet has the n higher-order bits set to 1, followed by a bit set to 0. The remaining bit(s) of that octet contain bits from the value of the character to be encoded. The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded.

The table below summarizes the format of these different octet types. The letter 'x' indicates bits available for encoding bits of the UCS-4 character value.

UCS-4 range (hex.)
UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7FFF FFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx