Indiana Web Hosting - Indianapolis Website Hosting Provider - Vision Web Hosting

Chapter 4. Coded Character Sets And Encodings in the World

4.4.3 Problems on Unicode

All standards are not free from politics and compromise. Though a concept of united single CCS

for all characters in the world is very nice, Unicode had to consider compatibility with preceding

international and local standards. And more, unlike the ideal concept, Unicode people considered

efficiency too much. IMHO, surrogate pair is a mess caused by lack of 16bit code space. I will

introduce a few problems on Unicode.

Han Unification

This is the point on which Unicode is criticized most strongly among many Japanese people.

A region of 0x4e00   0x9fff in UCS 2 is used for Eastern Asian ideographs (Japanese Kanji, Chinese

Hanzi, and Korean Hanja). There are similar characters in these four character sets. (There are

two sets of Chinese characters, simplified Chinese used in P. R. China and traditional Chinese

used in Taiwan). To reduce the number of these ideograms to be encoded (the region for these

characters can contain only 20992 characters while only Taiwan CNS 11643 standard contains

48711 characters), these similar characters are assumed to be the same. This is Han Unification.

However these characters are not exactly the same. If fonts for these characters are made from

Chinese one, Japanese people will regard them wrong characters, though they may be able to

read. Unicode people think these united characters are the same character with different glyphs.

An example of Han Unification is available at U+9AA8 (

http://www.unicode.org/cgi bin/

GetUnihanData.pl?codepoint=9AA8

). This is a Kanji character for 'bone'. U+8FCE (

http:

//www.unicode.org/cgi bin/GetUnihanData.pl?codepoint=8FCE

) is an another exam

ple of a Kanji character for 'welcome'. The part from left side to bottom side is 'run' radical. 'Run'

radical is used for many Kanjis and all of them have the same problem. U+76F4 (

http://www.

unicode.org/cgi bin/GetUnihanData.pl?codepoint=76F4

) is an another example of a

Kanji character for 'straight'. I, a native Japanese speaker, cannot recognize Chiense version at all.

Unicode font vendors will hesitate to choose fonts for these characters, simplified Chinese char

acter, traditional Chinese one, Japanese one, or Korean one. One method is to supply four fonts

of simplified Chinese version, traditional Chinese version, Japanese version, and Korean version.

Commercial OS vendor can release localized version of their OS   for example, Japanese version

of MS Windows can include Japanese version of Unicode font (this is what they are exactly doing).

However, how should XFree86 or Debian do? I don't know. . .

7 8

XFree86 4.0 includes Japanese and Korean versions of ISO 10646 1 fonts.

I heard that Chinese and Korean people don't mind the glyph of these characters. If this is always true, Japanese

glyphs should be the default glyphs for these problematic characters for international systems such as Debian.

footer