UTF-8

Example - Low code point - £

For multibyte characters, when considered in binary, UTF-8 bytes start with a number of 1's followed by a single 0. These prefixes are stripped off the the remaining bits concatenated to build the unicode character number (code point). The number of leading 1 bits for the first byte is qual to the number of bytes consumed for this character, all following bytes within the character encoding start with "10".

Example: the UK Pound symbol "£"

This character has has the unicode code point of (hex) A3: 1010 0011

This character needs two UTF-8 bytes - so the bytes will look like: 110x xxxx 10xx xxxx

Slotting in the code point and padding with 0s: 1100 0010 1010 0011

Which in hexadecimal is: C2 A3

Example - High code point - ぐ

The Hiragana character with code point (hex) 30 50: 0011 0000 0101 0000

With 14 bits to encode, this will require three bytes: 1110 xxxx 10xx xxxx 10 xx xxxx

Slotting in the code point bits: 1110 0011 1000 0001 1001 0000

Which in hexadecimal is: e3 81 90

Finding the code point for a UTF-8 encoding

This is easier than encoding - the first byte tells you how many bytes are included in this character. Grab those bytes, strip the leading 1*0 sequences and concatenate for the code point:

Using the Hiragana example - step backwards through the sequence:

  1. Hex: e3 81 90 / Binary: 1110 0011 1000 0001 1001 0000

  2. Three leading 1 bits so use template: 1110 xxxx 10xx xxxx 10 xx xxxx

  3. Remaining bits: 0011 0000 0101 0000

  4. Code point: 30 50

Detecting UTF-8

This is non-trivial. For one there are multiple 8-bit encodings that use the available code points in different ways. Using Latin1/iso-8859-1 as an example (but you would want to be sure that you were receiving either latin1 or utf-8):

"C2 A3" is a valid iso-8859-1 string (bytes are legal) however the character sequence "LATIN CAPITAL LETTER A WITH CIRCUMFLEX" "POUND SIGN" is unlikely

Because of this some systems insert a BOM (zero-width non-breaking space) http://en.wikipedia.org/wiki/Byte_Order_Mark at the start of the data (EF BB BF in utf-8) - which can usually be used as reliable indicator of being utf-8 (even though it's also a valid latin1 string).

BradsWiki: Programming Notes/Utf-8 (last edited 2011-12-01 01:20:29 by BradleyDean)