Python unicode indexing shows different character

Question

I have a Unicode string in a "narrow" build of Python 2.7.10 containing a Unicode character. I'm trying to use that Unicode character as a lookup in a dictionary, but when I index the string to get the last Unicode character, it returns a different string:

>>> s = u'Python is fun \U0001f44d'
>>> s[-1]
u'\udc4d'

Why is this happening, and how do I retrieve '\U0001f44d' from the string?

Edit: unicodedata.unidata_version is 5.2.0 and sys.maxunicode is 65535.

If that's a real MCVE, you have a very strange Python 2.7 build. Please edit into your question the values of `unicodedata.unidata_version` and `sys.maxunicode`? — wim, Mar 20 '19 at 18:00
https://stackoverflow.com/questions/35404144/correctly-extract-emojis-from-a-unicode-string — Josh Lee, Mar 20 '19 at 19:13
I *can* in fact repro on MacOS Mojave using the preinstalled `/usr/bin/python`, Perhaps your question should mention your platform (though it's visible from the screenshot if you know where to look). — tripleee, Mar 20 '19 at 19:35

tripleee · Answer 1 · 2019-03-21T19:04:54.863

3

Looks like your Python 2 build uses surrogates for representing code points outside of the Basic Multilingual Plane. See e.g. How to work with surrogate pairs in Python? for a bit of background.

My recommendation would be to switch to Python 3 for anything involving string handling as soon as possible.

edited Mar 21 '19 at 19:04

answered Mar 20 '19 at 18:47

tripleee

175,061
34
275
318

score 2 · Accepted Answer · answered Mar 20 '19 at 20:45

2

A Python 2 "narrow" build uses UTF-16 to store Unicode strings (a so-called leaky abstraction, so code points >U+FFFF are two UTF surrogates. To retrieve the code point, you have to get both the leading and trailing surrogate:

Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:25:58) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'Python is fun \U0001f44d'
>>> s[-1]     # Just the trailing surrogate
u'\udc4d'
>>> s[-2:]    # leading and trailing
u'\U0001f44d'

Switch to Python 3.3+ where the problem has been solved and storage details of Unicode code points in a Unicode string are not exposed:

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'Python is fun \U0001f44d'
>>> s[-1]   # code points are stored in Unicode strings.
'\U0001f44d'

answered Mar 20 '19 at 20:45

Mark Tolonen

166,664
26
169
251

1

The narrow build uses UCS-2, which is a little different to UTF-16. See https://en.wikipedia.org/wiki/UTF-16 for details. – PM 2Ring Mar 21 '19 at 11:16
1

@PM2Ring "UCS-2 disallows use of [surrogates], but UTF-16 allows their use in pairs". – Josh Lee Mar 21 '19 at 11:22
@JoshLee It's complicated. ;) Despite the support of surrogate pairs, the [Python 2.7 docs](http://docs.python.org/3.1/library/sys.html#sys.maxunicode) refer to UCS-2. Also see https://stackoverflow.com/q/53140775/4014959 – PM 2Ring Mar 21 '19 at 11:42
Java and JavaScript have very similar behavior but claim UTF-16 and not UCS-2 – Josh Lee Mar 21 '19 at 12:24
The docs can say anything, but the behavior is UTF16. Microsoft calls UTF-16 “Unicode”. – Mark Tolonen Mar 21 '19 at 13:41

Python unicode indexing shows different character

2 Answers2