Indiana Web Hosting - Indianapolis Website Hosting Provider - Vision Web Hosting

Chapter 4. Coded Character Sets And Encodings in the World

Cross Mapping Tables

Unicode intents to be a superset of all major encodings in the world, such as ISO 8859 *, EUC

*, KOI8 *, and so on. The aim of this is to keep round trip compatibility and to enable smooth

migration from other encodings to Unicode.

Only providing a superset is not sufficient. Reliable cross mapping tables between Unicode and

other encodings are needed. They are provided by Unicode Consortium (

http://www.unicode.

org/Public/MAPPINGS/

).

However, tables for East Asian encodings are not provided. They were provided but now are

obsolete (

http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/

).

You may want to use these mapping tables even though they are obsolete, because there are no

other mapping tables available. However, you will find a severe problem for these tables. There

are multiple different mapping tables for Japanese encodings which include JIS X 0208 character

set. Thus, one same character in JIS X 0208 will be mapped into different Unicode characters

according to these mapping tables. For example, Microsoft and Sun use different table, which

results in Java on MS Windows sometimes break Japanese characters.

Though we Open Source people should respect interoperativity, we cannot achieve sufficient in

teroperativity because of this problem. All what we can achieve is interoperativity between Open

Source softwares.

GNU libc uses JIS/JIS0208.TXT (

http://www.unicode.org/Public/MAPPINGS/OBSOLETE/

EASTASIA/JIS/JIS0208.TXT

) with a small modification. The modification is that

  original JIS0208.TXT: 0x815F 0x2140 0x005C # REVERSE SOLIDUS

  modified: 0x815F 0x2140 0xFF3C # FULLWIDTH REVERSE SOLIDUS

The reason of this modification is that JIS X 0208 character set is almost always used with combina

tion with ASCII in form of EUC JP and so on. ASCII 0x5c, not JIS X 0208 0x2140, should be mapped

into U+005C. This modified table is found at

/usr/share/i18n/charmaps/EUC JP.gz

in De

bian system. Of course this mapping table is NOT authorized nor reliable.

I hope Unicode Consortium to release an authorized reliable unique mapping table between Uni

code and JIS X 0208. You can read the detail of this problem (

http://www.debian.or.jp/

~kubota/unicode symbols.html

).

Combining Characters

Unicode has a way to synthesize a accented character by combining an accent symbol and a base

character. For example, combining 'a' and '~' makes 'a' with tilde. More than two accent symbol

can be added to a base character.

footer