Schaake.nu

Encoding

Gepost in /Software op 29 Juni 2012
Deze blog is geschreven door Christiaan Schaake

UTF Encoding

Unicode Transformation Format is a way of describing a large set of characters (including all Western, Arabic, Chinese etc.) in a uniform way. The characters can be encoded using multi-byte (UTF-8), double byte (UTF-16) or four byte encodings (UTF-32). Officially the encoding is spelled as case insensitive UTF dash 8, 16 or 32 and optionally BE or LE for big endian or little endian (e.g. UTF-16LE or utf-8).
There are 3 types of encoding which must match for an XML file.

  • Byte Order Mark
  • XML encoding
  • Character Encoding
The file encoding tells what kind of encoding is used in the file. This is done by the BOM (Byte Order Mark). The XML encoding tells what kind of encoding is used in the XML document, this must match the BOM. for some encodings the BOM is not required. The Characterset must math the encoding set in the file or in the XML encoding.

Byte Order Mark

The Byte Order Mark (BOM) specifies the type of encoding used in a file. The BOM is set in the first bytes of a file. The first 2 bytes for UTF-16, the first 4 bytes for UTF-32 and the first 3 bytes for UTF-8.
The character used for the BOM is a Zero-Width No-Break Space character. So the character is not displayed in editors. Normally a BOM would not show up in the middle of a file, but if this happens for some reason the character will still not be displayed.

Byte order mark
Bytes Encoding form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8

In some situations the BOM is not required. The BOM for UTF-8 is optional and if the XML encoding is set to UTF-16LE or UTF-16BE the BOM is also not set.

XML Encoding

In the XML declaration tag the encoding must be specified. <?xml version="1.0" encoding="UTF-8"?> The encoding must match the BOM. So if the BOM specifies a UTF-16 file, the encoding must also be set to UTF-16. Browsers are known to give an error message when the BOM and the encoding do not match.

When the encoding is set to UTF-16LE or UTF-16BE the BOM is not set.

Character encoding

Unicode uses 3 different character sets and 2 different byte orders. The 3 character sets are UTF-8, UTF-16 and UTF-32. Since UTF-16 and UTF-32 characters are made up by 2 or 4 bytes, the order must also be specified. The encoding can start with the smallest byte or the greatest byte, these are called little endian and big endian byte orders.
The table below shows the code of characters in the different encodings.

Character encoding
Character UTF-8 UTF-16LE UTF-16BE UTF-32LE UTF-32BE
z 7A 7A 00 00 7A 7A 00 00 00 00 00 00 7A
é C3 A9 E9 00 00 E9 E9 00 00 00 00 00 00 E9
水 (chinese water) E6 B0 B4 34 6C 6C 34 34 6C 00 00 00 00 6C 34

Be aware that most Windows editors (e.g. Notepad, Ultraedit) will translate any file to UTF-16 or ASCI when opening the file. If you want to see the exact encoding, use a HEX editor.

UTF-8

Characters in UTF-8 are represented by 1, 2 or 3 bytes (actually we can go up to 4 bytes but this is not supported by all applications). The first 128 characters are the same as the first 128 characters of the US-ASCII character set, so the letter z is represented as 7A. But the characters above 128 use 2, 3 or 4 bytes. E.g. the é character is the US-ASCII character 233, so this is bigger than 128 and will be represented by the 2 byte code C3 A9.
The exact layout of UTF-8 is displayed in the table below:

UTF-8
Byte range Description
0 - 31 Used by the first byte, ASCII Controll characters
32 - 127 Used by the first byte, ASCII Printable characters
128 - 191 Used by the second byte
192 - 233 Used by the third byte

The bytes in UTF-8 do not overlap, this will limit the number of unique characters but is still sufficient to place all Unicode characters within the UTF-8 character-set.
The characters 0xFF and 0xFE are not used in UTF-8 to prevent interference with UTF-16.
Actually the characters 0xC0, 0xC1 and the range 0xF5 till 0xFF are not valid in UTF-8.

UTF-16

Characters in UTF-16 are always represented by 2 bytes (16 bits). The first 256 characters are the same as the ISO 8859-1 characters set, so a letter z is represented as 00 7A in UTF-16BE. The first 127 characters of the ISO 8859-1 (and UTF-16) match the US-ASCII character set as well.

ASCII

ASCII (American Standard Code for Information Interchange) is the character-set used for all modern Enlish and other Western European languages. The preferred encoding name is US-ASCII.
ASCII is basically a 7 bit character-set made up of 128 characters. The first 32 characters (0-31) are control characters. These characters were used to operate early printers and include the carriage return (13) and the line feed (10).
Characters 32 till 127 are printable characters (actually 127 is the backspace and not a really printable character). The uppercase characters can be calculated by adding 32 to the lowercase character value, or setting the sixth bit in the byte to 1.

Extended ASCII

Extended ASCII is an extension on US-ASCII. The Extended ASCII uses the 8th bit and describes the characters 128-256. These characters include the language specific characters. Since the 128 extra characters is not enough to specify all language specific characters a lot of Extended ASCII variant are available. These variants are defined by the code-page.

Codepage

The code-page defines the upper part of the Extended ASCII character-set (characters 128-256). The code-page for the American language is code-page 437 which includes special characters for the American market. The Greek use code-page 737 to be able to display the Greek characters.
The lower part of the Extended ASCII character-set (characters 0-127) is the same in all code-pages.

Deze blog is getagd als Encoding Programmeren

Google
facebook