Convert unicode codepoint to UTF8 hex in python

Question

I want to convert a number of unicode codepoints read from a file to their UTF8 encoding.

e.g I want to convert the string 'FD9B' to the string 'EFB69B'.

I can do this manually using string literals like this:

u'\uFD9B'.encode('utf-8')

but I cannot work out how to do it programatically.

score 23 · Accepted Answer · edited Apr 04 '20 at 17:39

23

Use the built-in function chr() to convert the number to character, then encode that:

>>> chr(int('fd9b', 16)).encode('utf-8')
'\xef\xb6\x9b'

This is the string itself. If you want the string as ASCII hex, you'd need to walk through and convert each character c to hex, using hex(ord(c)) or similar.

Note: If you are still stuck with Python 2, you can use unichr() instead.

edited Apr 04 '20 at 17:39

Peque

13,638
11
69
105

answered May 15 '09 at 10:18

unwind

391,730
64
469
606

4

The output is not as specified by the question. Anyway, if the OP is happy… – tzot May 15 '09 at 19:55
6

FYI for Py3K it's `chr(int('fd9b', 16)).encode('utf-8')`. – Matthieu Riegler Jul 15 '14 at 12:11
@tzot: `''.join('{:02X}'.format(n) for n in chr(int('FD9B', 16)).encode())` gives the string `'EFB69B'` in Python 3. – CodeManX Apr 07 '16 at 20:36
I edited your answer to go with the Python 3 solution and adding a note in case someone is still stuck with Python 2. I hope you don't mind... `chr(int('1f607', 16))` – Peque Apr 04 '20 at 17:43

score 4 · Answer 2 · answered Mar 03 '13 at 02:22

4

here's a complete solution:

>>> ''.join(['{0:x}'.format(ord(x)) for x in unichr(int('FD9B', 16)).encode('utf-8')]).upper()
'EFB69B'

answered Mar 03 '13 at 02:22

simon

15,344
5
45
67

score 3 · Answer 3 · answered May 15 '09 at 15:05

3

data_from_file='\uFD9B'
unicode(data_from_file,"unicode_escape").encode("utf8")

answered May 15 '09 at 15:05

pixelbeat

30,615
9
51
60

score 2 · Answer 4 · answered May 15 '09 at 10:20

Python 2.6.2 (r262:71600, Apr 16 2009, 09:17:39) 
[GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\uFD9B'.encode('utf-8')
'\xef\xb6\x9b'
>>> s = 'FD9B'
>>> i = int(s, 16)
>>> i
64923
>>> unichr(i)
u'\ufd9b'
>>> _.encode('utf-8')
'\xef\xb6\x9b'

score 1 · Answer 5 · answered May 15 '09 at 19:54

If the input string length is a multiple of 4 (i.e. your unicode code points are UCS-2 encoded), then try this:

import struct

def unihex2utf8hex(arg):
    count= len(arg)//4
    uniarr= struct.unpack('!%dH' % count, arg.decode('hex'))
    return u''.join(map(unichr, uniarr)).encode('utf-8').encode('hex')

>>> unihex2utf8hex('fd9b')
'efb69b'

Jaymon · Answer 6 · 2017-12-08T23:01:51.510

Because you might encounter an error while using unichr with wide unicode characters:

>>> n = int('0001f600', 16)
>>> unichr(n)
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Here is another approach for wide unicode on narrow python builds:

>>> n = int('0001f600', 16)
>>> s = '\\U{:0>8X}'.format(n)
>>> s = s.decode('unicode-escape')
>>> s.encode("utf-8")
'\xf0\x9f\x98\x80'

And using the original question's value:

>>> n = int('FD9B', 16)
>>> s = '\\u{:0>4X}'.format(n)
>>> s = s.decode('unicode-escape')
>>> s.encode("utf-8")
'\xef\xb6\x9b'

Convert unicode codepoint to UTF8 hex in python

6 Answers6

Linked

Related