Questions about Technical Area

1. What is UCS?
UCS stands for Universal Multiple-Octet Coded Character Set. It is the name of ISO/IEC 10646. A character is located and coded at a cell within this coding space or the cell is declared unused.

UCS has two coding forms. One is called UCS-4 and the other is called UCS-2. UCS-4 is also referred to as the canonical form of UCS in which each character is represented by a 4-octet code and each character is located within the coded character set in terms of one of its 128 Group-octet (00-7F), 256 Plane-octet (00-FF), 256 Row-octet (00-FF), and 256 Cell-octet (00-FF).

ISO/IEC 10646 specifies the first plane (Plane 00 of Group 00) to be the Basic Multilingual Plane (BMP).

UCS-2 is the 2-Octet BMP form. It can represent all characters in BMP in 2-octet . It also uses the surrogate pair method to represent an additional 16 planes in Group 0 of UCS-4.

2. What is the difference between UCS-2 and UCS-4?
In the 2-octet form of UCS, UCS-2 contains 65,536 positions for coding characters which belongs to plane 0 of group 0. It can also represent plane 1 to 16 in UCS-4 using surrogate pairs.

Similarly, UCS-4 uses the 4-octet form to represent a character in ISO/IEC 10646 standard. UCS-2 can only represent the first 17 planes in UCS-4.

3. What is UTF?
UTF stands for Unicode or UCS Transformation Format. In fact, it defines a set of different transformations of UCS as different representations for data transfer and also in consideration to compatibility issue of other encoding. The most common transformation formats includes UTF-8, and UTF-16. UTF-7 was also used sometimes for 7-bit data transfer.

4. What is the difference between UTF-8, UTF-16?
UTF-8 uses variable byte to store a Unicode. In different code range, it has its own code length, varies from 1 byte to 6 bytes. Because it varies from 8 bits (1 byte), it is so called "UTF-8". UTF-8 is suitable for using on Internet, networks or some kind of applications that needs to use slow connection.

Unicode (or UCS) Transformation Format, 16-bit encoding form. The UTF-16 is the Unicode Transformation Format that serializes a Unicode scalar value (code point) as a sequence of two bytes, in either big-endian or little-endian format. Because it is grouped by 16-bits (2 bytes), it is also called "UTF-16", which is the most commonly used standard.

5. What is Surrogate?
Surrogate pairs provide a mechanism for using the two-byte BMP code range to represent another 16 planes of characters of group 0 without requiring the use of 32-bit characters. Because predominantly infrequently used characters will be assigned to surrogate pairs, not all implementations need to handle these pairs initially. It is widely expected that Surrogate pairs are now implemented by vendors in support of the ISO/IEC 10646-2:2001 and Unicode 3.1.

6. What is Glyph?
Glyph refers to the actual shape of a character. A Chinese character glyph gives the geometric structure, such as strokes, components and relative positions of the strokes and components.

7. What programming languages support Unicode?
Unicode are supported in many programming languages, such as Unicode programming in Java, in C/Linux, and in Microsoft Visual C++ and so on. HTML files also support UTF8.