2

I have a Unicode string in a "narrow" build of Python 2.7.10 containing a Unicode character. I'm trying to use that Unicode character as a lookup in a dictionary, but when I index the string to get the last Unicode character, it returns a different string:

>>> s = u'Python is fun \U0001f44d'
>>> s[-1]
u'\udc4d'

Why is this happening, and how do I retrieve '\U0001f44d' from the string?

Edit: unicodedata.unidata_version is 5.2.0 and sys.maxunicode is 65535.

Screenshot of issue

wim
  • 338,267
  • 99
  • 616
  • 750
Tim
  • 2,756
  • 1
  • 15
  • 31
  • 3
    If that's a real MCVE, you have a very strange Python 2.7 build. Please edit into your question the values of `unicodedata.unidata_version` and `sys.maxunicode`? – wim Mar 20 '19 at 18:00
  • @wim Added those edits. It is, in fact, a real MCVE. – Tim Mar 20 '19 at 18:43
  • Can't repro; https://ideone.com/y7jalr – tripleee Mar 20 '19 at 18:51
  • I take it `len(u'\U0001f44d')` returns `2` on your Python? – wim Mar 20 '19 at 18:55
  • @wim Yes, `len(u'\U0001f44d')` returns 2. – Tim Mar 20 '19 at 18:56
  • 2
    https://stackoverflow.com/questions/35404144/correctly-extract-emojis-from-a-unicode-string – Josh Lee Mar 20 '19 at 19:13
  • 1
    I *can* in fact repro on MacOS Mojave using the preinstalled `/usr/bin/python`, Perhaps your question should mention your platform (though it's visible from the screenshot if you know where to look). – tripleee Mar 20 '19 at 19:35

2 Answers2

3

Looks like your Python 2 build uses surrogates for representing code points outside of the Basic Multilingual Plane. See e.g. How to work with surrogate pairs in Python? for a bit of background.

My recommendation would be to switch to Python 3 for anything involving string handling as soon as possible.

tripleee
  • 175,061
  • 34
  • 275
  • 318
2

A Python 2 "narrow" build uses UTF-16 to store Unicode strings (a so-called leaky abstraction, so code points >U+FFFF are two UTF surrogates. To retrieve the code point, you have to get both the leading and trailing surrogate:

Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:25:58) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'Python is fun \U0001f44d'
>>> s[-1]     # Just the trailing surrogate
u'\udc4d'
>>> s[-2:]    # leading and trailing
u'\U0001f44d'

Switch to Python 3.3+ where the problem has been solved and storage details of Unicode code points in a Unicode string are not exposed:

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'Python is fun \U0001f44d'
>>> s[-1]   # code points are stored in Unicode strings.
'\U0001f44d'
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • 1
    The narrow build uses UCS-2, which is a little different to UTF-16. See https://en.wikipedia.org/wiki/UTF-16 for details. – PM 2Ring Mar 21 '19 at 11:16
  • 1
    @PM2Ring "UCS-2 disallows use of [surrogates], but UTF-16 allows their use in pairs". – Josh Lee Mar 21 '19 at 11:22
  • @JoshLee It's complicated. ;) Despite the support of surrogate pairs, the [Python 2.7 docs](http://docs.python.org/3.1/library/sys.html#sys.maxunicode) refer to UCS-2. Also see https://stackoverflow.com/q/53140775/4014959 – PM 2Ring Mar 21 '19 at 11:42
  • Java and JavaScript have very similar behavior but claim UTF-16 and not UCS-2 – Josh Lee Mar 21 '19 at 12:24
  • The docs can say anything, but the behavior is UTF16. Microsoft calls UTF-16 “Unicode”. – Mark Tolonen Mar 21 '19 at 13:41