Chapter 4. Coded Character Sets And Encodings in the World
30
Languages such as Thai need combining characters. Combining characters are the only method
to express characters in these languages.
However, a few problems arises.
Duplicate Encoding There are multiple ways to express the same character. For example, u with
umlaut can be expressed as
u+00fc
and also as
u+0075
+
U+0308
. How can we implement
'grep' and so on?
Open Repertoire Number of expressible characters grows unlimitedly. Non existing characters
can be expressed.
Surrogate Pair
The first version of Unicode had only 16bit code space, though 16bit is obviously insufficient to
contain all characters in the world.
9
Thus surrogate pair is introduced in Unicode 2.0, to expand
the number of characters, with keeping compatibility with former 16bit Unicode.
However, surrogate pair breaks the principle that all characters are expressed with the same width
of bits. This makes Unicode programming more difficult.
Fortunately, Debian and other UNIX like systems will use UTF 8 (not UTF 16) as a usual encoding
for UCS. Thus, we don't need to handle UTF 16 and surrogate pair very often.
ISO 646 * Problem
You will need a codeset converter between your local encodings (for example, ISO 8859 * or ISO
2022 *) and Unicode. For example, Shift JIS encoding
10
consists from JISX 0201 Roman (Japanese
version of ISO 646), not ASCII, which encodes yen currency mark at
0x5c
where backslash is
encoded in ASCII.
Then which should your converter convert
0x5c
in Shift JIS into in Unicode,
u+005c
(backslash)
or
u+00a5
(yen currency mark)? You may say yen currency mark is the right solution. However,
backslash (and then yen mark) is widely used for escape character. For example, 'new line' is
expressed as 'backslash  
n
' in C string literal and Japanese people use 'yen currency mark  
n
'.
You may say that program sources must written in ASCII and the wrong point is that you tried
to convert program source. However, there are many source codes and so on written in Shift JIS
encoding.
9
There are a few projects such as Mojikyo (
http://www.mojikyo.gr.jp/
) (about 90000 characters), TRON
project (
http://www.tron.org/index e.html
) (about 130000 characters), and so on to develop a CCS which con 
tains sufficient characters for professional usage in CJK world.
10
The standard encoding for Macintosh and MS Windows.






footer




 

 

 

 

 Home | About Us | Network | Services | Support | FAQ | Control Panel | Order Online | Sitemap | Contact

indiana web hosting

 

Our partners: PHP: Hypertext Preprocessor Best Web Hosting Java Web Hosting Inexpensive Web Hosting  Jsp Web Hosting

Cheapest Web Hosting Jsp Hosting Cheap Hosting

Visionwebhosting.net Business web hosting division of Web Design Plus. All rights reserved