Indiana Web Hosting - Indianapolis Website Hosting Provider - Vision Web Hosting

Chapter 6. LOCALE technology

6.3 Multibyte Characters and Wide Characters

Now we will concentrate on LC_CTYPE, which is the most important category in six locale cate

gories.

Many encodings such as ASCII, ISO 8859 *, KOI8 R, EUC *, ISO 2022 *, TIS 620, UTF 8, and so on

are used widely in the world. It is inefficient and a cause of bugs, even not impossible, for every

softwares to implement all these encodings. Fortunately, we can use LOCALE technology to solve

this problem.

Multibyte characters is a term to call characters encoded in locale specific encoding. It is nothing

special. It is mere a word to call our daily encodings. In ISO 8859 1 locale, ISO 8859 1 is multibyte

character. In EUC JP locale, EUC JP is multibyte character. In UTF 8 locale, UTF 8 is multibyte

character. In short, multibyte character is defined by

LC_CTYPE

locale category. Multibyte char

acters is used when your software inputs or outputs text data from/to everywhere out of your

software, for example, standard input/output, display, keyboard, file, and so on, as you are doing

everyday.

You can handle multibyte characters using ordinal

char

or

unsigned char

types and ordinal

character  and string oriented functions. It is just like you used to do for ASCII and 8bit encodings.

Then why we call it with a special term of multibyte character? The answer is, ISO C specifies a set

of functions which can handle multibyte characters properly. On the other hand, it is obvious that

usual C functions such as

strlen()

cannot handle multibyte characters properly.

Then what is these functions which can handle multibyte characters properly? Please wait a

minute. Multibyte character may be stateful or stateless and multibyte or non multibyte, since

it includes all encodings ever used and will be used on the earth. Thus it is not convenient for

internal processing. It needs complex algorithm even for, for example, character extraction from

a string, addition and division of a string, or counting of number of character in a string. Thus,

wide characters should be used for internal processing. And, the main part of these C functions

which can handle multibyte characters are functions for interconversion between multibyte char

acters and wide characters. These functions are introduced later. Note that you may be able to do

without these functions, since ISO C supplies I/O functions with conversion.

Wide character is defined in ISO C

  that all characters are expressed in fixed width of bits.

  that it is stateless, i.e., it doesn't have shift states.

Usage of UCS 4 is the second best solution for this problem. Sometimes LOCALE technology cannot be used and

UCS 4 is the best. I will discuss this solution later.

There are a few exceptions. Compound text should be used for communication between X clients. UTF 8 would

be the standard for file names in Linux.

footer