ISO

UTF-8

UTF-8 uses variable byte to store a Unicode. In different code range, it has its own code length, varies from 1 byte to 3 bytes for UCS 2 and up to 6 bytes for UCS-4. Because it varies from 8 bits (1 byte), it is so called "UTF-8". UTF-8 is suitable for data transfer on the Internet, networks or some kind of applications that needs to use slow connection (e.g. through modem). For English text transfer, UTF-8 provides a much shorter data stream than other Unicode transformation format.

In UTF-8, characters are encoded using sequences of 1 to 6 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character value. In a sequence of n octets, n>1, the initial octet has the n higher-order bits set to 1, followed by a bit set to 0. The remaining bit(s) of that octet contain bits from the value of the character to be encoded. The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded.

The table below summarizes the format of these different octet types. The letter 'x' indicates bits available for encoding bits of the UCS-4 character value.

UCS-4 range (hex.)			UTF-8 octet sequence (binary)
0000	0000-0000	007F	0xxxxxxx
0000	0080-0000	07FF	110xxxxx	10xxxxxx
0000	0800-0000	FFFF	1110xxxx	10xxxxxx	10xxxxxx
0001	0000-001F	FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx
0020	0000-03FF	FFFF	111110xx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx
0400	0000-7FFF	FFFF	1111110x	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx