Chapter 6. LOCALE technology
56
6.3 Multibyte Characters and Wide Characters
Now we will concentrate on LC_CTYPE, which is the most important category in six locale cate 
gories.
Many encodings such as ASCII, ISO 8859 *, KOI8 R, EUC *, ISO 2022 *, TIS 620, UTF 8, and so on
are used widely in the world. It is inefficient and a cause of bugs, even not impossible, for every
softwares to implement all these encodings. Fortunately, we can use LOCALE technology to solve
this problem.
1
Multibyte characters is a term to call characters encoded in locale specific encoding. It is nothing
special. It is mere a word to call our daily encodings. In ISO 8859 1 locale, ISO 8859 1 is multibyte
character. In EUC JP locale, EUC JP is multibyte character. In UTF 8 locale, UTF 8 is multibyte
character. In short, multibyte character is defined by
LC_CTYPE
locale category. Multibyte char 
acters is used when your software inputs or outputs text data from/to everywhere out of your
software, for example, standard input/output, display, keyboard, file, and so on, as you are doing
everyday.
2
You can handle multibyte characters using ordinal
char
or
unsigned char
types and ordinal
character  and string oriented functions. It is just like you used to do for ASCII and 8bit encodings.
Then why we call it with a special term of multibyte character? The answer is, ISO C specifies a set
of functions which can handle multibyte characters properly. On the other hand, it is obvious that
usual C functions such as
strlen()
cannot handle multibyte characters properly.
Then what is these functions which can handle multibyte characters properly? Please wait a
minute. Multibyte character may be stateful or stateless and multibyte or non multibyte, since
it includes all encodings ever used and will be used on the earth. Thus it is not convenient for
internal processing. It needs complex algorithm even for, for example, character extraction from
a string, addition and division of a string, or counting of number of character in a string. Thus,
wide characters should be used for internal processing. And, the main part of these C functions
which can handle multibyte characters are functions for interconversion between multibyte char 
acters and wide characters. These functions are introduced later. Note that you may be able to do
without these functions, since ISO C supplies I/O functions with conversion.
Wide character is defined in ISO C
  that all characters are expressed in fixed width of bits.
  that it is stateless, i.e., it doesn't have shift states.
1
Usage of UCS 4 is the second best solution for this problem. Sometimes LOCALE technology cannot be used and
UCS 4 is the best. I will discuss this solution later.
2
There are a few exceptions. Compound text should be used for communication between X clients. UTF 8 would
be the standard for file names in Linux.






footer




 

 

 

 

 Home | About Us | Network | Services | Support | FAQ | Control Panel | Order Online | Sitemap | Contact

indiana web hosting

 

Our partners: PHP: Hypertext Preprocessor Best Web Hosting Java Web Hosting Inexpensive Web Hosting  Jsp Web Hosting

Cheapest Web Hosting Jsp Hosting Cheap Hosting

Visionwebhosting.net Business web hosting division of Web Design Plus. All rights reserved