UTF-8
UTF-8 uses variable
byte to store a Unicode. In different code range, it has its own
code length, varies from 1 byte to 3 bytes for UCS 2 and up to 6
bytes for UCS-4. Because it varies from 8 bits (1 byte), it is so
called "UTF-8". UTF-8 is suitable for data transfer on the Internet,
networks or some kind of applications that needs to use slow connection
(e.g. through modem). For English text transfer, UTF-8 provides
a much shorter data stream than other Unicode transformation format.
In UTF-8, characters are encoded using sequences of 1 to 6 octets.
The only octet of a "sequence" of one has the higher-order bit set
to 0, the remaining 7 bits being used to encode the character value.
In a sequence of n octets, n>1, the initial octet has the n higher-order
bits set to 1, followed by a bit set to 0. The remaining bit(s)
of that octet contain bits from the value of the character to be
encoded. The following octet(s) all have the higher-order bit set
to 1 and the following bit set to 0, leaving 6 bits in each to contain
bits from the character to be encoded.
The table below summarizes the format of these different octet types.
The letter 'x' indicates bits available for encoding bits of the
UCS-4 character value.
UCS-4
range (hex.)
|
|
UTF-8
octet sequence (binary)
|
|
|
|
0000 |
0000-0000 |
007F |
|
0xxxxxxx |
|
|
|
|
|
0000 |
0080-0000 |
07FF |
|
110xxxxx |
10xxxxxx |
|
|
|
|
0000 |
0800-0000 |
FFFF |
|
1110xxxx |
10xxxxxx |
10xxxxxx |
|
|
|
0001 |
0000-001F |
FFFF |
|
11110xxx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
|
|
0020 |
0000-03FF |
FFFF |
|
111110xx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
|
0400 |
0000-7FFF |
FFFF |
|
1111110x |
10xxxxxx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
|