6

Below is a simple test. repr seems to work fine. yet len and x for x in doesn't seem to divide the unicode text correctly in Python 2.6 and 2.7:

In [1]: u""
Out[1]: u'\U0002f920\U0002f921'

In [2]: [x for x in u""]
Out[2]: [u'\ud87e', u'\udd20', u'\ud87e', u'\udd21']

Good news is Python 3.3 does the right thing ™.

Is there any hope for Python 2.x series?

Dima Tisnek
  • 11,241
  • 4
  • 68
  • 120

1 Answers1

11

Yes, provided you compiled your Python with wide-unicode support.

By default, Python is built with narrow unicode support only. Enable wide support with:

./configure --enable-unicode=ucs4

You can verify what configuration was used by testing sys.maxunicode:

import sys
if sys.maxunicode == 0x10FFFF:
    print 'Python built with UCS4 (wide unicode) support'
else:
    print 'Python built with UCS2 (narrow unicode) support'

A wide build will use UCS4 characters for all unicode values, doubling memory usage for these. Python 3.3 switched to variable width values; only enough bytes are used to represent all characters in the current value.

Quick demo showing that a wide build handles your sample Unicode string correctly:

$ python2.6
Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
1114111
>>> [x for x in u'\U0002f920\U0002f921']
[u'\U0002f920', u'\U0002f921']
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 1
    Which encoding does 3.3 use? – David Heffernan Oct 15 '13 at 18:43
  • 1
    @DavidHeffernan: See [PEP 393](http://docs.python.org/3/whatsnew/3.3.html#pep-393); up to UCS4, dropping down to UCS2 if the 2 LSB bytes are 0 for all characters, down to Latin-1 if the remaining LSB is 0 for all characters. – Martijn Pieters Oct 15 '13 at 18:45
  • osx bundled 2.5, 2.6, 2.7: 0xffff python.org 2.6.5: 0xffff python.org 3.3: 0x10ffff pypy-2.0.2: 0x10ffff – Dima Tisnek Oct 15 '13 at 20:25
  • 4
    @qarma: Python 3.3 did away with narrow vs. wide altogether, so `sys.maxunicode` is hardcoded to 0x10ffff there. Any Python 2 versions bundled with OS X releases are all narrow (but modern macOS releases use Python 3). – Martijn Pieters Oct 15 '13 at 20:58