about Technical Area
1. What is UCS?
UCS stands for Universal Multiple-Octet Coded Character Set. It is
the name of ISO/IEC 10646. A character is located and coded at a cell
within this coding space or the cell is declared unused.
UCS has two coding forms. One is called UCS-4 and the other is called
UCS-2. UCS-4 is also referred to as the canonical form of UCS in which
each character is represented by a 4-octet code and each character
is located within the coded character set in terms of one of its 128
Group-octet (00-7F), 256 Plane-octet (00-FF), 256 Row-octet (00-FF),
and 256 Cell-octet (00-FF).
ISO/IEC 10646 specifies the first plane (Plane 00 of Group 00) to
be the Basic Multilingual Plane (BMP).
UCS-2 is the 2-Octet BMP form. It can represent all characters in
BMP in 2-octet . It also uses the surrogate pair method to represent
an additional 16 planes in Group 0 of UCS-4.
is the difference between UCS-2 and UCS-4?
In the 2-octet form of UCS, UCS-2 contains 65,536 positions for
coding characters which belongs to plane 0 of group 0. It can also
represent plane 1 to 16 in UCS-4 using surrogate pairs.
Similarly, UCS-4 uses the 4-octet form to represent a character
in ISO/IEC 10646 standard. UCS-2 can only represent the first 17
planes in UCS-4.
UTF stands for Unicode or UCS Transformation Format. In fact, it
defines a set of different transformations of UCS as different representations
for data transfer and also in consideration to compatibility issue
of other encoding. The most common transformation formats includes
UTF-8, and UTF-16. UTF-7 was also used sometimes for 7-bit data
is the difference between UTF-8, UTF-16?
UTF-8 uses variable byte to store a Unicode. In different code range,
it has its own code length, varies from 1 byte to 6 bytes. Because
it varies from 8 bits (1 byte), it is so called "UTF-8". UTF-8 is
suitable for using on Internet, networks or some kind of applications
that needs to use slow connection.
Unicode (or UCS) Transformation Format, 16-bit encoding form. The
UTF-16 is the Unicode Transformation Format that serializes a Unicode
scalar value (code point) as a sequence of two bytes, in either
big-endian or little-endian format. Because it is grouped by 16-bits
(2 bytes), it is also called "UTF-16", which is the most commonly
Surrogate pairs provide a mechanism for using the two-byte BMP code
range to represent another 16 planes of characters of group 0 without
requiring the use of 32-bit characters. Because predominantly infrequently
used characters will be assigned to surrogate pairs, not all implementations
need to handle these pairs initially. It is widely expected that
Surrogate pairs are now implemented by vendors in support of the
ISO/IEC 10646-2:2001 and Unicode 3.1.
Glyph refers to the actual shape of a character. A Chinese character
glyph gives the geometric structure, such as strokes, components
and relative positions of the strokes and components.
programming languages support Unicode?
Unicode are supported in many programming languages, such as Unicode
programming in Java, in C/Linux, and in Microsoft Visual C++ and
so on. HTML files also support UTF8.