UTF-8

The standard: http://www.unicode.org/standard/standard.html
The RFC: http://tools.ietf.org/html/rfc3629
Wikipedia description: UTF-8
Programming_Notes/PythonUnicode

Example - Low code point - £

For multibyte characters, when considered in binary, UTF-8 bytes start with a number of 1's followed by a single 0. These prefixes are stripped off the the remaining bits concatenated to build the unicode character number (code point). The number of leading 1 bits for the first byte is qual to the number of bytes consumed for this character, all following bytes within the character encoding start with "10".

Example: the UK Pound symbol "£"

This character has has the unicode code point of (hex) A3: 1010 0011

This character needs two UTF-8 bytes - so the bytes will look like: 110x xxxx 10xx xxxx

Slotting in the code point and padding with 0s: 1100 0010 1010 0011

Which in hexadecimal is: C2 A3

Example - High code point - ぐ

The Hiragana character with code point (hex) 30 50: 0011 0000 0101 0000

With 14 bits to encode, this will require three bytes: 1110 xxxx 10xx xxxx 10 xx xxxx

Slotting in the code point bits: 1110 0011 1000 0001 1001 0000

Which in hexadecimal is: e3 81 90

Finding the code point for a UTF-8 encoding

This is easier than encoding - the first byte tells you how many bytes are included in this character. Grab those bytes, strip the leading 1*0 sequences and concatenate for the code point:

Using the Hiragana example - step backwards through the sequence:

Hex: e3 81 90 / Binary: 1110 0011 1000 0001 1001 0000
Three leading 1 bits so use template: 1110 xxxx 10xx xxxx 10 xx xxxx
Remaining bits: 0011 0000 0101 0000
Code point: 30 50

Detecting UTF-8

This is non-trivial. For one there are multiple 8-bit encodings that use the available code points in different ways. Using Latin1/iso-8859-1 as an example (but you would want to be sure that you were receiving either latin1 or utf-8):

searching for bytes that aren't in Latin1 (ie 0x80 through 0x9f - the space between ASCII and the high byte chars) is unreliable because it's possible for utf-8 data to miss this block by chance (just by ending up with characters which don't encode with those bytes)
if high-byte characters are present (ie anything not in ascii - 0x00-0x7f) try to decode as utf-8 and check for errors
This will sometimes work because in general iso-8859-1 encoded data will not have the correct sequence of characters to by-chance be valid utf-8
This is not fool-proof but probably more reliable - for example the GBP encoding is:
- utf-8 : C2 A3
- iso-8859-1 : A3

"C2 A3" is a valid iso-8859-1 string (bytes are legal) however the character sequence "LATIN CAPITAL LETTER A WITH CIRCUMFLEX" "POUND SIGN" is unlikely

Because of this some systems insert a BOM (zero-width non-breaking space) http://en.wikipedia.org/wiki/Byte_Order_Mark at the start of the data (EF BB BF in utf-8) - which can usually be used as reliable indicator of being utf-8 (even though it's also a valid latin1 string).

BradsWiki: Programming Notes/Utf-8 (last edited 2011-12-01 01:20:29 by BradleyDean)