Indiana Web Hosting - Indianapolis Website Hosting Provider - Vision Web Hosting

Chapter 3. Important Concepts for Character Coding Systems

'fi'). For almost cases, text data, which intend to contain not visual information but abstract

idea, don't have to have information on glyphs, since difference between glyphs does not

affect the meaning of the text. However, distinction between different glyphs for a single

CJK ideogram may be sometimes important for proper noun such as names of persons and

places. However, there are no standardized method for plain text to have informations on

glyphs so far. This makes plain texts cannot be used for some special fields such as citizen

registration system, serious DTP such as newspaper system, and so on.

Encoding Encoding is a rule where characters and texts are expressed in combinations of bits or

bytes in order to treat characters in computers. Words of character coding system, character

code, charset, and so on are used to express the same meaning. Basically, encoding takes care

of characters, not glyphs. There are many official and de facto standards of encodings such as

ASCII, ISO 8859 {1,2,. . . ,15}, ISO 2022 {JP, JP 1, JP 2, KR, CN, CN EXT, INT 1, INT 2}, EUC

{JP, KR, CN, TW}, Johab, UHC, Shift JIS, Big5, TIS 620, VISCII, VSCII, so called 'CodePages',

UTF 7, UTF 8, UTF 16LE, UTF 16BE, KOI8 R, and so on so on. To construct an encoding, we

have to consider the following concepts. (Encoding = one or more CCS + one CES).

Character Set Character set is a set of characters. This determines a range of characters where

the encoding can handle. In contrast to coded character set, this is often called as non coded

character set.

Coded Character Set (CCS) Coded character set (CCS) is a word defined in RFC 2050 (

http:

//www.faqs.org/rfcs/rfc2050.html

) and means a character set where all characters

have unique numbers by some method. There are many national and international stan

dards for CCS. Many national standards for CCS adopt the way of coding so that they obey

some of international standards such as ISO 646 or ISO 2022. ASCII, BS 4730, JISX 0201 Ro

man, and so on are examples of ISO 646 variants. All ISO 646 variants, ISO 8859 *, JISX 0208,

JISX 0212, KSX 1001, GB 2312, CNS 11643, CCCII, TIS 620, TCVN 5712, and so on are exam

ples of ISO 2022 compliant CCS. VISCII and Big5 are examples of non ISO 2022 compliant

CCS. UCS 2 and UCS 4 (ISO 10646) are also examples of CCS.

Character Encoding Scheme (CES) Character Encoding Scheme is also a word defined in RFC

2050 (

http://www.faqs.org/rfcs/rfc2050.html

) to call methods to construct an en

coding using one or more CCS. This is important when two or more CCS are used to con

struct an encoding. ISO 2022 is a method to construct an encoding from one or more ISO

2022 compliant CCS. ISO 2022 is very complex system and subsets of ISO 2022 are usually

used such as EUC JP (ASCII and JISX 0208), ISO 2022 KR (ASCII and KSX 1001), and so on.

CES is not important for encodings with only one 8bit CCS. UTF series (UTF 8, UTF 16LE,

UTF 16BE, and so on) can be regarded as CES whose CCS is Unicode or ISO 10646.

Some other words are usually used related to character codes.

Character code is a widely used word to mean encoding. This is an primitive and crude word

to call the way a computer handles characters with assigning numbers. For example, character

footer