Confused about unicode representations

Question

I am confused about hex representation of Unicode. I have an example file with a single mathematical integral sign character in it. That is U+222B If I cat the file or edit it in vi I get an integral sign displayed. A hex dump of the file shows its hex content is 88e2 0aab

In python I can create an integral unicode character and print p rendering on my terminal and integral sign.

>>> p=u'\u222b'
>>> p
u'\u222b'
>>> print p
∫

What confuses me is I can open a file with the integral sign in it, get the integral symbol but the hex content is different.

>>> c=open('mycharfile','r').read()
>>> c
'\xe2\x88\xab\n'
>>> print c
∫

One is a Unicode object and one is a plain string but what is the relationship between the two hex codes apparently for the same character? How would I manually convert one to another?

`0x222b` = 8747 is the integer number of the codepoint that is, in Unicode, associated with the integral sign, `∫`. when you write text to a file or send it over the wire, it must always be serialized to bits—commonly, octets (bytes) are the preferred units here. the series `0xe2`, `0x88`, `0xab` (or `0b11100010`, `0b10001000`, `0b10101011` in binary) is the UTF-8 encoding (http://en.wikipedia.org/wiki/UTF-8) of `0x222b`. incidentally, the three leading `1`s in the first byte tell you that this codepoint is encoded in three bytes: UTF-8 is both variable-width and 'synchronizing'. — flow, Sep 10 '13 at 22:45
that bitly link does look promising. also one should point out that Unicode handling is much saner in Py3 than it used to be in Py2—to the point where this one factor should weigh heavily when deciding about which Python version to use. sadly, there's an ungood and ongoing split between Py2 and Py3, with 3rd party library support lagging. where Py3 shines is that the old 'ASCII strings' are gone; you always deal with a buffer of bytes (encoded) or else a (Unicode) text (decoded). it's just changed concepts / naming things, but then programming is a lot about concepts and naming things. — flow, Sep 11 '13 at 10:54
In addition to the changed concepts and names, Py3 also has the safer behavior of not implicitly converting between bytes and strings. Try to concatenate them and it'll complain immediately, which is much better than the Py2 approach of having it usually work but fail messily when the default encoding couldn't convert. — Peter DeGlopper, Sep 11 '13 at 23:21
I'm still missing something. Byte pairs are reversed from 88e2 0aab contained in the hex edit of the character and one character is a return so we are left with 0xe2, 0x88, 0xab — Keir, Sep 12 '13 at 01:47
This means first byte is two, want another pair of twos. The next two bits 10 signify its a unicharacter but the next six bit is an eight not a two? — Keir, Sep 12 '13 at 01:56

score 3 · Answer 1 · edited May 23 '17 at 12:28

The plain string has been encoded using UTF-8, one of a variety of ways to represent Unicode code points in bytes. UTF-8 is a multibyte encoding which has the often useful feature that it is a superset of ASCII - the same byte encodes any ASCII character in UTF-8 or in ASCII.

In Python 2.x, use the encode method on a Unicode object to encode it, and decode or the unicode constructor to decode it:

>>> u'\u222b'.encode('utf8')
'\xe2\x88\xab'
>>> '\xe2\x88\xab'.decode('utf8')
u'\u222b'
>>> unicode('\xe2\x88\xab', 'utf8')
u'\u222b'

print, when given a Unicode argument, implicitly encodes it. On my system:

>>> sys.stdout.encoding
'UTF-8'

See this answer for a longer discussion of print's behavior: Why does Python print unicode characters when the default encoding is ASCII?

Python 3 handles things a bit differently; the changes are documented here: http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

Must Read: [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html) by Joel Spolsky. — Jongware, Sep 10 '13 at 22:23

score 0 · Answer 2 · answered Sep 12 '13 at 03:47

Okay i have it. Thanks for the answers. i wanted to see how to do the conversion rather than convert a string using Python.

the conversion works this way.

If you have a unicode character, in my example an integral symbol.

Octal dump produces

echo -n "∫"|od -x
0000000 88e2 00ab

Each hex pair are reversed so it really means

e288ab00

The first hex character is E. the high bit means this is a Unicode string and the next two bits indicate it is 3 three bytes (16 bits) to represent the character. The first two bits of the remaining hex digits are throw away (they signify they are unicode.) the full bit stream is

111000101000100010101011

Throw away the first 4 bits and the first two bits of the remaining hex digits

0010001000101011

Re-expressing this in hex

222B

They you have it!

"the high bit means this is a Unicode string" isn't quite right. It blurs the line between the use of characters that weren't in ASCII with the UTF-8 specific encoding details. More precisely, the high bit means it's part of a multi-byte encoding; the number of leading 1s before the first 0 tell you the total number of bytes in the encoding (3, in this case). You have the actual processing correct, but I recommend closely reading the Joel on Software essay Jongware linked to. Unicode and encodings are related concepts, but not as interchangeable as this wording implies. — Peter DeGlopper, Sep 12 '13 at 04:09

Confused about unicode representations

2 Answers2