26

I'm trying to write out to a flat file some Chinese, or Russian or various non-English character-sets for testing purposes. I'm getting stuck on how to output a Unicode hex-decimal or decimal value to its corresponding character.

For example in Python, if you had a hard coded set of characters like абвгдежзийкл you would assign value = u"абвгдежзийкл" and no problem.

If however you had a single decimal or hex decimal like 1081 / 0439 stored in a variable and you wanted to print that out with it's corresponding actual character (and not just output 0x439) how would this be done? The Unicode decimal/hex value above refers to й.

Jeremy
  • 1
  • 85
  • 340
  • 366
stoneferry
  • 279
  • 1
  • 3
  • 4
  • You might want to revise the title of your question. It mentions UTF-8, yet the question has nothing to do with UTF-8. – NPE May 23 '12 at 08:12
  • 2
    Your constant mention of "decimal or hex" makes be think that you are in ignorance of the fact that "decimal or hex" is just a matter of representation and not a property of the value itself. – glglgl May 23 '12 at 08:18

4 Answers4

36

Python 2: Use unichr():

>>> print(unichr(1081))
й

Python 3: Use chr():

>>> print(chr(1081))
й
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • Thanks for the answer, although it's not what I'm looking for since I already know how to handle hard-coded entries. I want to know how to handle decimal or hex-decimal unicode values in a variable for standard output, or output to file. – stoneferry May 23 '12 at 08:07
  • @stoneferry: Just change the `1081` to the name of your variable that contains the character code. – NPE May 23 '12 at 08:08
  • If I has a variable that contained integer '1081' only, how would I for example use the print command to output the character and not jsut '1081'. – stoneferry May 23 '12 at 08:09
  • 4
    If your variable is a string with the hexadecimal form of a number, you can convert it to an int with `int(var, 16)`. For example, `int('0x0439', 16)` gives `1081`. – Nathan Craike Oct 09 '12 at 04:32
  • 2
    Note: [`unichr`](https://docs.python.org/2/library/functions.html#unichr) is only for Python 2. In Python 3, you can simply use [`chr`](https://docs.python.org/3/library/functions.html#chr). – Martin Thoma Feb 25 '15 at 15:11
10

So the answer to the question is:

  1. convert the hexadecimal value to decimal with int(hex_value, 16)
  2. then get the corresponding strin with chr().

To sum up:

>>> print(chr(int('0x897F', 16)))
西
Édouard Lopez
  • 40,270
  • 28
  • 126
  • 178
  • 2
    `0x` is optional if you explicitly specify the base, i.e. `chr(int('897F', 16))` will work too – ccpizza Jul 18 '20 at 20:20
  • why? `chr(0x897F)` works fine. `0x897F` is an integer literal https://docs.python.org/3/reference/lexical_analysis.html#integer-literals – FlipMcF Nov 24 '21 at 18:23
3

While working on a project that included parsing some JSONs, I encountered a similar problem. I had a lot of strings that had all non-ASCII characters escaped like this:

>>> print(content)
\u0412\u044B j\u0435\u0441\u0442\u0435 \u0438\u0437 \u0420\u043E\u0441\u0441\u0438\u0438?
...
>>> print(content)
\u010Cemu jesi na\u010Dinal izu\u010Dati med\u017Euslovjansky jezyk?

Converting such mixes symbol-by-symbol with unichr() would be tedious. The solution I eventually decided on:

content.encode("utf8").decode("unicode-escape")

The first operation (encoding) produces bytestrings like this:

b'\\u0412\\u044B j\\u0435\\u0441\\u0442\\u0435 \\u0438\\u0437 \\u0420\\u043E\\u0441\\u0441\\u0438\\u0438?'
b'\\u010Cemu jesi na\\u010Dinal izu\\u010Dati med\\u017Euslovjansky jezyk?'

and the second operation (decoding) transforms the byte string into Unicode string but with \\ replaced by \, which "unpacks" the characters, giving the result like this:

Вы jесте из России?
Čemu jesi načinal izučati medžuslovjansky jezyk?
0

If you run into the error:

ValueError: unichr() arg not in range(0x10000) (narrow Python build)

While trying to convert your hex value using unichr, you can get around that error by doing something like:

>>> n = int('0001f600', 16)
>>> s = '\\U{:0>8X}'.format(n)
>>> s
'\\U0001F600'
>>> binary = s.decode('unicode-escape')
>>> print(binary)

Jaymon
  • 5,363
  • 3
  • 34
  • 34