92

In Python API, is there a way to extract the unicode code point of a single character?

Edit: In case it matters, I'm using Python 2.7.

user
  • 1,220
  • 1
  • 12
  • 31
Ken
  • 30,811
  • 34
  • 116
  • 155
  • 1
    e.g. for '\u304f' I want '304f'. is that what 'ord()' will do? Yes- http://docs.python.org/library/functions.html#ord – Ken Sep 03 '11 at 04:45
  • 2
    Yes, `ord("\N{HIRAGANA LETTER KU}")` is indeed 12367, aka 0x304F. I would never use numbers for characters the way you do, only named ones the way I do. Magic numbers are bad for your program. Just think of `chr` and `ord` as inverse functions of each other. It’s really easy. – tchrist Sep 03 '11 at 04:48
  • @tchrist it might be worth noting `chr` is the opposite of `ord` in python 3.x, but in python 2.x `unichr` is the inverse of `ord` as `chr` only works for ordinals up to 255 in python 2.x. – cryo Sep 03 '11 at 05:08
  • @David: Yes, but I consider that a legacy system, which doesn't really work very well for Unicode — as you have yourself just demonstrated. `chr` and `ord` were always meant to be inverses, and it was a legacy Python 2 bug that they sometimes weren't. That's nuts. – tchrist Sep 03 '11 at 05:09
  • 2
    @tchrist there are still lots of people using python 2.x. Even in python 3.x there are still narrow Unicode builds (for example most Windows builds of python 3.x are narrow.) I wouldn't call most 2.x Unicode issues bugs so much as additions to maintain backwards compatibility with older scripts, python 2.x usually works fine with Unicode. python 3.0 does make things much more consistent though, eliminating the difference between `str` and `unicode`. – cryo Sep 03 '11 at 05:27
  • If `c` is my character variable (say it's equal to `あ`), if I do `ucp = ord(c)` then `print ucp` I get three integers, not a single integer. How do I get a single integer? – Ken Sep 03 '11 at 05:30
  • In case it matters I'm using Python 2.7. – Ken Sep 04 '11 at 10:03

5 Answers5

102

If I understand your question correctly, you can do this.

>>> s='㈲'
>>> s.encode("unicode_escape")
b'\\u3232'

Shows the unicode escape code as a source string.

Keith
  • 42,110
  • 11
  • 57
  • 76
  • 2
    In case it matters, I'm using Python 2.7. – Ken Sep 04 '11 at 09:56
  • What does the `b` mean ? – MK Yung Dec 18 '13 at 05:42
  • @MKYung That prefix means it's a byte string literal. – Keith Dec 18 '13 at 07:34
  • 4
    For me, this doesn't work with ASCII characters: `'a'.encode('unicode_escape')` gives `a` instead of '\\u. (Same with `u'a'.encode('unicode_escape')`.) Also, the format is different when you go outside the Basic Multilingual Plane: `u''.encode('unicode_escape')` gives `'\\U0001f631'`. – ShreevatsaR Dec 29 '13 at 09:51
  • 4
    @ShreevatsaR Try `"a".encode("unicode_escape").hex()` to get the hexadecimal representation as a `str`. Alternatively, `hex(ord("a"))` will also work. – imrek May 15 '17 at 13:56
73
>>> ord(u"ć")
263
>>> u"café"[2]
u'f'
>>> u"café"[3]
u'\xe9'
>>> for c in u"café":
...     print repr(c), ord(c)
... 
u'c' 99
u'a' 97
u'f' 102
u'\xe9' 233
Mike Graham
  • 73,987
  • 14
  • 101
  • 130
  • 3
    Of course, it might print out `u'e' 101` and `u'\u0301' 769` at the end insstead... – Dietrich Epp Sep 03 '11 at 04:35
  • 3
    It looks like 'ord()' does what I want: http://docs.python.org/library/functions.html#ord. Thanks. – Ken Sep 03 '11 at 04:46
  • If 'c' is my character variable (say it's equal to 'あ'), if I do `ucp = ord(c)` then `print ucp` I get three integers, not a single integer. How do I get a single integer? – Ken Sep 03 '11 at 05:26
  • How did you get あ into the variable? If it's a literal in your source code, then make sure your source file has an appropriate encoding set. Otherwise, ask a new question and post more detailed code. – Karl Knechtel Sep 03 '11 at 06:30
  • In case it matters, I'm using Python 2.7. – Ken Sep 04 '11 at 09:56
  • important thing to mention is, that it doesn't work in older versions of ipython (for example in 0.10.2, which is in Debian Squeeze). In normal python (for example 2.6.*) it works OK – Michel Samia Aug 22 '12 at 10:30
  • I tried this same example with "བཞིན" but it does not work. Do you have an idea how I can yield the same result as with "café" in double-byte character sets? i.e. my case is same as OP's comment above. You can validate by using the code example from Mike Graham above, but use the characters I've provided. – mikkokotila Jan 02 '18 at 16:29
  • @mikkokotilaYou don't mention your platform or Python version. Unfortunately, the details do vary. On Python 2, if you use `u"བཞིན"` (not `"བཞིན"`, you don't run into problems from the fact that the characters are bigger than one byte -- it will however treat this as four characters, with the ི and the ཞ considered two different things. I don't know if Unicode includes such combinations for Tibetan like it does for accented Latin (where both one-codepoint é (`u'\xe9'`) and two-codepoint é (`u'e\u0301'`) exist. Sorry I can't be more helpful. – Mike Graham Jan 06 '18 at 16:42
15

Turns out getting this right is fairly tricky: Python 2 and Python 3 have some subtle issues with extracting Unicode code points from a string.

Up until Python 3.3, it was possible to compile Python in one of two modes:

  1. sys.maxunicode == 0x10FFFF

In this mode, Python's Unicode strings support the full range of Unicode code points from U+0000 to U+10FFFF. One code point is represented by one string element:

>>> import sys
>>> hex(sys.maxunicode)
'0x10ffff'
>>> len(u'\U0001F40D')
1
>>> [c for c in u'\U0001F40D']
[u'\U0001f40d']

This is the default for Python 2.7 on Linux, as well as universally on Python 3.3 and later across all operating systems.

  1. sys.maxunicode == 0xFFFF

In this mode, Python's Unicode strings only support the range of Unicode code points from U+0000 to U+FFFF. Any code points from U+10000 through U+10FFFF are represented using a pair of string elements in the UTF-16 encoding::

>>> import sys
>>> hex(sys.maxunicode)
'0xffff'
>>> len(u'\U0001F40D')
2
>>> [c for c in u'\U0001F40D']
[u'\ud83d', u'\udc0d']

This is the default for Python 2.7 on macOS and Windows.

This runtime difference makes writing Python modules to manipulate Unicode strings as series of codepoints quite inconvenient.

The codepoints module

To solve this, I contributed a new module codepoints to PyPI:

https://pypi.python.org/pypi/codepoints/1.0

This module solves the problem by exposing APIs to convert Unicode strings to and from lists of code points, regardless of the underlying setting for sys.maxunicode::

>>> hex(sys.maxunicode)
'0xffff'
>>> snake = tuple(codepoints.from_unicode(u'\U0001F40D'))
>>> len(snake)
1
>>> snake[0]
128013
>> hex(snake[0])
'0x1f40d'
>>> codepoints.to_unicode(snake)
u'\U0001f40d'
Ben Hamilton
  • 151
  • 1
  • 3
  • Hello I'm trying to use codepoints with https://en.wikipedia.org/wiki/Regional_Indicator_Symbol offsets to make flags of various countries in Python. Here is a javascript implementation: https://github.com/thekelvinliu/country-code-emoji/blob/9d6d20f99f66ef88e01b72f62367e2a950bf1936/src/index.js How do I use `codepoints.to_unicode(x)` on a modified codes that has been offset by the appropriate letters of the basic flag? – thadk Mar 06 '17 at 02:06
  • UPDATE: figured it out, to_unicode needs at least a two-tuple. – thadk Mar 06 '17 at 02:23
  • @thadk , glad you figured it out—but could you share with me the first code snippet you tried? I'm curious what didn't work. – Ben Hamilton Mar 07 '17 at 16:55
  • ```import codepoints #does not work #print(codepoints.to_unicode(tuple(127462))) #works print(codepoints.to_unicode((127462,))) #works ("AU" Australia Flag) print(codepoints.to_unicode((127462,127482)))``` – thadk Mar 08 '17 at 03:07
12

Usually, you just do ord(character) to find the code point of a character. For completeness though, wide characters in the Unicode Supplementary Multilingual Plane are represented as surrogate pairs (i.e. two code units) in narrow Python builds, so in that case I often needed to do this small work-around:

def get_wide_ordinal(char):
    if len(char) != 2:
        return ord(char)
    return 0x10000 + (ord(char[0]) - 0xD800) * 0x400 + (ord(char[1]) - 0xDC00)

This is rare in most applications though, so normally just use ord().

Samie Bencherif
  • 1,285
  • 12
  • 27
cryo
  • 14,219
  • 4
  • 32
  • 35
  • A surrogate pair is NOT "two characters". It represents ONE character. It consists of two code points. See "code point" and "code point type" in http://unicode.org/glossary/ – John Machin Sep 03 '11 at 11:05
  • 4
    @JohnMachin: You're close, but not quite: A surrogate pair is still just one code point. It's two code units. – Thanatos Feb 06 '13 at 22:12
  • @Thanatos: Have you actually read the link that I provided? Have you followed through to `D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF.` and the low equivalent D73? – John Machin Feb 07 '13 at 10:03
  • 1
    @JohnMachin: It is slightly confusing that the standard uses that terminology. I suppose in some ways, they are code points — code points in those ranges are reserved for surrogate pairs. I think the standard is getting that the code points are reserved, that is all. Note, "The high-surrogate and low-surrogate code points are designated for *surrogate code units* in the UTF-16 character encoding form. They are unassigned to any abstract character." – Thanatos Feb 07 '13 at 21:55
  • 1
    My point was that a surrogate pair, once decoded, represent a single code point. There's only two things: the encoded UTF-16 stream of code units, or the decoded code point stream; for surrogate pairs, you'll have 2 in the former and 1 in the latter. – Thanatos Feb 07 '13 at 21:55
4

python2

>>> print hex(ord(u'人'))
0x4eba