Unicode Programming in C/Linux

Unlike Microsoft Windows, Linux uses UTF-8. Before you start using UTF-8 under Linux make sure the distribution has glibc 2.2 and XFree86 4.0 or newer versions. Earlier versions lack UTF-8 locale support and ISO10646-1 X11 fonts.

There are two approaches for adding UTF-8 support to a Linux application. First, data is stored in UTF-8 form everywhere, which results in only a very few software changes (passive). Alternatively, UTF-8 data that has been read is converted into wide-character arrays using standard C library functions (converted). Strings are converted back to UTF-8 when output as with the function wcsrtombs():



Small changes are needed for programs that count characters by counting the bytes. In UTF-8 applications do not count any continuation bytes. The C library strlen(s) function needs to be replaced with the mbstowcs() function if a UTF-8 locale has been selected:



A common use of strlen is to estimate display-width. Chinese and other ideographic characters will occupy two column positions. The wcwidth() function is used to test the display-width of each character:



Officially, starting with GNU glibc 2.2, the type wchar_t is intended to be used only for 32-bit ISO 10646 values, independent of the currently used locale. This is signaled to applications by the definition of the __STDC_ISO_10646__ macro as required by ISO C99. The __STDC_ISO_10646__ is defined to indicate that wchar_t is Unicode. The exact value is a decimal constant of the form yyyymmL. For example, use:



The proper way to activate UTF-8 is the POSIX locale mechanism. A locale is a configuration setting that contains information about culture-specific conventions of software behavior. This includes character encoding, date/time notation, sorting rules and measurement systems. The names of locales usually consist of ISO 639-1 language, ISO 3166-1 country codes and optional encoding names and other qualifiers. You can get a list of all locales installed on your system (usually in /usr/lib/locale/) with the command locale -a.

If you want to convert from other encoding standard (e.g. Big5) to Unicode, you can write a Perl script, generate an array that contains the original code as the index, and the target code as the value. The use regular expression to substitute the Big5 code to Unicode.