Chapter 4. Coded Character Sets And Encodings in the World
27
UTF 16
UTF 16 is an encoding whose CCS is 20bit Unicode.
Characters in BMP are expressed using 16bit value of code point in Unicode CCS. There are two
ways to express 16bit value in 8bit stream. Some of you may heard a word endian. Big endian means
an arrangement of octets which are part of a datum with many bits from most significant octet to
least significant one. Little endian is opposite. For example, 16bit value of
0x1234
is expressed as
0x12 0x34
in big endian and
0x34 0x12
in little endian.
UTF 16 supports both endians. Thus, Unicode character of
u+1234
can be expressed either in
0x12 0x34
or
0x34 0x12
. Instead, the UTF 16 texts have to have a BOM (Byte Order Mark) at
first of them. The Unicode character
u+feff
zero width no break space is called BOM when it is
used to indicate the byte order or endian of texts. The mechanism is easy: in big endian,
u+feff
will be
0xfe 0xff
while it will be
0xff 0xfe
in little endian. Thus you can understand the
endian of the text by reading the first two bytes.
6
Characters not included in BMP are expressed using surrogate pair. Code points of
u+d800
u+dfff
are reserved for this purpose. At first, 20 bits of Unicode code point are divided into two
sets of 10 bits. The significant 10 bits are mapped to 10bit space of
u+d800
u+dbff
. The smaller
10 bits are mapped to 10bit space of
u+dc00
u+dfff
. Thus UTF 16 can express 20bit Unicode
characters.
UTF 16BE and UTF 16LE
UTF 16BE and UTF 16LE are variants of UTF 16 which are limited to big and little endians, re 
spectively.
UTF 7
UTF 7 is designed so that Unicode can be communicated using 7bit communication path.
***** Not written yet *****
UCS 2 and UCS 4 as encodings
Though I introduced UCS 2 and UCS 4 are CCS, they can be encodings.
In UCS 2 encoding, Each UCS 2 character is expressed in two bytes. In UCS 4 encoding, Each
UCS 4 character is expressed in four bytes.
6
I heard that BOM is mere a suggestion by a vendor. Read Markus Kuhn's UTF 8 and Unicode FAQ for Unix/Linux
(
http://www.cl.cam.ac.uk/~mgk25/unicode.html
) for detail.






footer




 

 

 

 

 Home | About Us | Network | Services | Support | FAQ | Control Panel | Order Online | Sitemap | Contact

indiana web hosting

 

Our partners: PHP: Hypertext Preprocessor Best Web Hosting Java Web Hosting Inexpensive Web Hosting  Jsp Web Hosting

Cheapest Web Hosting Jsp Hosting Cheap Hosting

Visionwebhosting.net Business web hosting division of Web Design Plus. All rights reserved