8

Since Python 2.2 and PEP 261, Python can be built in "narrow" or "wide" mode, which affects the definition of a "character", i.e. "the addressable unit of a Python Unicode string".

Characters in narrow builds look like UTF-16 code units:

>>> a = u'\N{MAHJONG TILE GREEN DRAGON}'
>>> a
u'\U0001f005'
>>> len(a)
2
>>> a[0], a[1]
(u'\ud83c', u'\udc05')
>>> [hex(ord(c)) for c in a.encode('utf-16be')]
['0xd8', '0x3c', '0xdc', '0x5']

(The above seems to disagree with some sources that insist that narrow builds use UCS-2, not UTF-16. Very intriguing indeed)

Does Python 3.0 keep this distinction? Or are all Python 3 builds wide?

(I've heard about PEP 393 that changes internal representation of strings in 3.3, but this doesn't relate to 3.0 ~ 3.2.)

Community
  • 1
  • 1
Kos
  • 70,399
  • 25
  • 169
  • 233

1 Answers1

9

Yes, from 3.0 to 3.2 they do. Windows uses narrow builds while (most) Unix uses wide builds

Using Python 3.2 on Windows:

>>> a = '\N{MAHJONG TILE GREEN DRAGON}'
>>> len(a)
2
>>> a
''

While this behavior is expected on 3.3+ using Windows:

>>> a = '\N{MAHJONG TILE GREEN DRAGON}'
>>> len(a)
1
>>> a
'\U0001f005'
>>> print(a)
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    print(a)
UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001f005' 
in position 0: Non-BMP character not supported in Tk

The UCS-2 codec is used on Tk (I'm using IDLE - the terminal may show another error).

JBernardo
  • 32,262
  • 10
  • 90
  • 115
  • I made a couple tests on what platforms are using which build. Narrow: Windows, Mac, NetBSD, OpenBSD, Solaris. Wide: Linux, FreeBSD. Tested with Python 2.7. So I can't say most Unix* use wide builds. – JonnyJD Jun 16 '13 at 14:13
  • @jonnyJD: For Python 2.7, correct. The statement above was for 3.0 to 3.2. – Andreas Maier Apr 17 '20 at 06:39