Surrogate and Handling of Surrogate Pairs

In order to extend the coding space beyond the BMP, ISO/IEC 10646:2000 has defined a special zone in BMP, called Surrogate Zone with range from D800 to DFFF. It is divided into two parts. One is called High-half zone (D800-DBFF) and the other is called Low-half zone (DC00-DFFF). A pair consisting of one 2 bytes code from High-half zone and one from Low-half zone represents a character beyond the BMP. The first two bytes of the pair is called High-surrogates (D800-DBFF), and the second of the pair is called Low-surrogates (DC00-DFFF).

Surrogate pairs provide a mechanism for using the two-byte BMP code range to represent another 16 planes of characters. Infrequently used characters will be assigned to surrogate pairs, not all implementations need to handle these pairs initially. Surrogate pairs are now implemented by vendors in support of the ISO/IEC 10646-2:2001 and Unicode 3.1

High-surrogates and low-surrogates are assigned to disjoint ranges of code positions. Non-surrogate characters can never be assigned to these ranges. Because the high and low surrogate ranges are disjoint, determining character boundaries requires at most scanning one preceding or following Unicode code value, without regard to any other context. In well-formed text, a low-surrogate can be preceded only by a high-surrogate and a high-surrogate can be followed only by a low-surrogate.

As long as implementation does not remove either surrogate or insert another character between them, the data integrity is maintained. Moreover, even if the data become corrupted, the data corruption is localized. Corrupting a single Unicode value affects only a single character. Because both the high- and low- surrogates are disjoint and always occur in pairs, errors are prevented from propagating through the rest of the text.

More specifically, surrogate pairs < H,L > are used by two UCS-2 code H followed by L where H is in the range of D800 - DBFF and -L is in the range of DC00 - DFFF. The character represented by a surrogate pair can be mapped to UCS-4 code point. Suppose N is the scalar value of UCS-4, then

N=(H-D800)*400 + (L-DC00) + 10000

where H and L are the high-surrogate and L is the low-surrogate in a surrogate pair< H,L >. Thus, N is in the range of 10000 to 10FFFF