How do I convert an int representing a UTF-8 character into a Unicode code point?

Question

Let us use the character Latin Capital Letter a with Ogonek (U+0104) as an example.

I have an int that represents its UTF-8 encoded form:

my_int = 0xC484
# Decimal: `50308`
# Binary: `0b1100010010000100`

If use the unichr function i get: \uC484 or 쒄 (U+C484)

But, I need it to output: Ą

How do I convert my_int to a Unicode code point?

Interesting question. I'm curious as to what kind of API yields UTF-8 integers, though? — Cameron, Mar 26 '15 at 18:49
related: [Convert a Python int into a big-endian string of bytes](http://stackoverflow.com/q/846038/4279) — jfs, Mar 26 '15 at 22:03

score 3 · Answer 1 · answered Mar 26 '15 at 18:03

3

To convert the integer 0xC484 to the bytestring '\xc4\x84' (the UTF-8 representation of the Unicode character Ą), you can use struct.pack():

>>> import struct
>>> struct.pack(">H", 0xC484)
'\xc4\x84'

... where > in the format string represents big-endian, and H represents unsigned short int.

Once you have your UTF-8 bytestring, you can decode it to Unicode as usual:

>>> struct.pack(">H", 0xC484).decode("utf8")
u'\u0104'

>>> print struct.pack(">H", 0xC484).decode("utf8")
Ą

answered Mar 26 '15 at 18:03

Zero Piraeus

56,143
27
150
160

2

utf-8 encoding can use different number of bytes to encode different Unicode codepoints (from a byte upto 4-bytes). `'>H'` works only for 2-bytes sequences. – jfs Mar 26 '15 at 22:00

score 1 · Answer 2 · edited Mar 30 '17 at 09:53

Encode the number to a hex string, using hex() or %x. Then you can interpret that as a series of hex bytes using the hex decoder. Finally use the utf-8 decoder to get a unicode string:

def weird_utf8_integer_to_unicode(n):
    s= '%x' % n
    if len(s) % 2:
        s= '0'+s
    return s.decode('hex').decode('utf-8')

The len check is in case the first byte is in the range 0x1–0xF, which would leave it missing a leading zero. This should be able to cope with any length string and any character (however encoding a byte sequence in an integer like this would be unable to preseve leading zero bytes).

score 1 · Answer 3 · edited May 23 '17 at 12:02

1

>>> int2bytes(0xC484).decode('utf-8')
u'\u0104'
>>> print(_)
Ą

where int2bytes() is defined here.

edited May 23 '17 at 12:02

Community

1
1

answered Mar 26 '15 at 21:57

jfs

399,953
195
994
1,670

How do I convert an int representing a UTF-8 character into a Unicode code point?

3 Answers3