13

I need to step through a Python string one character at a time, but a simple "for" loop gives me UTF-16 code units instead:

str = "abc\u20ac\U00010302\U0010fffd"
for ch in str:
    code = ord(ch)
    print("U+{:04X}".format(code))

That prints:

U+0061
U+0062
U+0063
U+20AC
U+D800
U+DF02
U+DBFF
U+DFFD

when what I wanted was:

U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD

Is there any way to get Python to give me the sequence of Unicode code points, regardless of how the string is actually encoded under the hood? I'm testing on Windows here, but I need code that will work anywhere. It only needs to work on Python 3, I don't care about Python 2.x.

The best I've been able to come up with so far is this:

import codecs
str = "abc\u20ac\U00010302\U0010fffd"
bytestr, _ = codecs.getencoder("utf_32_be")(str)
for i in range(0, len(bytestr), 4):
    code = 0
    for b in bytestr[i:i + 4]:
        code = (code << 8) + b
    print("U+{:04X}".format(code))

But I'm hoping there's a simpler way.

(Pedantic nitpicking over precise Unicode terminology will be ruthlessly beaten over the head with a clue-by-four. I think I've made it clear what I'm after here, please don't waste space with "but UTF-16 is technically Unicode too" kind of arguments.)

agf
  • 171,228
  • 44
  • 289
  • 238
Ross Smith
  • 3,719
  • 1
  • 25
  • 22
  • Best I can do (on Python 2, narrow build like you) is `string.encode('utf-32-be')` then `for chars in (string[n:n+4] for n in range(0, len(string), 4)):` then `code = reduce(lambda x, y: (x << 8) + y, (ord(ch) for ch in chars))` – agf Sep 21 '11 at 03:26
  • 3
    I consider myself a pedantic nitpicker over precise Unicode terminology and think you've made yourself perfectly clear ;-) – Joachim Sauer Sep 21 '11 at 05:40
  • 1
    [`sys.maxunicode`](http://docs.python.org/library/sys.html#sys.maxunicode) is "An integer giving the largest supported code point for a Unicode character." Maybe unicode string iteration isn't supported for non-BMP characters if you're using a UTF-16 version of Python. I've asked this question at http://stackoverflow.com/questions/7495150/what-does-sys-maxunicode-mean. – Peter Graham Sep 21 '11 at 05:52

3 Answers3

7

On Python 3.2.1 with narrow Unicode build:

PythonWin 3.2.1 (default, Jul 10 2011, 21:51:15) [MSC v.1500 32 bit (Intel)] on win32.
Portions Copyright 1994-2008 Mark Hammond - see 'Help/About PythonWin' for further copyright information.
>>> import sys
>>> sys.maxunicode
65535

What you've discovered (UTF-16 encoding):

>>> s = "abc\u20ac\U00010302\U0010fffd"
>>> len(s)
8
>>> for c in s:
...     print('U+{:04X}'.format(ord(c)))
...     
U+0061
U+0062
U+0063
U+20AC
U+D800
U+DF02
U+DBFF
U+DFFD

A way around it:

>>> import struct
>>> s=s.encode('utf-32-be')
>>> struct.unpack('>{}L'.format(len(s)//4),s)
(97, 98, 99, 8364, 66306, 1114109)
>>> for i in struct.unpack('>{}L'.format(len(s)//4),s):
...     print('U+{:04X}'.format(i))
...     
U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD

Update for Python 3.3:

Now it works the way the OP expects:

>>> s = "abc\u20ac\U00010302\U0010fffd"
>>> len(s)
6
>>> for c in s:
...     print('U+{:04X}'.format(ord(c)))
...     
U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Thanks for the struct.unpack() trick; I didn't know you could do that and it's much shorter than my code. I think it's pretty clear by now that this is the best solution I'm going to get; apparently Python just doesn't support UTF-32 natively (outside of custom 32-bit builds). – Ross Smith Sep 21 '11 at 20:10
3

Python normally stores the unicode values internally as UCS2. The UTF-16 representation of the UTF-32 \U00010302 character is \UD800\UDF02, that's why you got that result.

That said, there are some python builds that use UCS4, but these builds are not compatible with each other.

Take a look here.

Py_UNICODE This type represents the storage type which is used by Python internally as basis for holding Unicode ordinals. Python’s default builds use a 16-bit type for Py_UNICODE and store Unicode values internally as UCS2. It is also possible to build a UCS4 version of Python (most recent Linux distributions come with UCS4 builds of Python). These builds then use a 32-bit type for Py_UNICODE and store Unicode data internally as UCS4. On platforms where wchar_t is available and compatible with the chosen Python Unicode build variant, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for either unsigned short (UCS2) or unsigned long (UCS4).

pablosaraiva
  • 2,343
  • 1
  • 27
  • 38
  • 1
    Yes, I know all that. I thought I made it clear in the original post that I understand perfectly well why I'm seeing what I'm seeing. I'm looking for a way to get UTF-32, not yet another explanation of how UTF-16 works. – Ross Smith Sep 21 '11 at 03:47
  • I'm sorry Ross. I understood from your question that you know how UTF16 and UTF-32 work. And by your reputation I can see you're not naive. The point is that once your python build is using UCS2 enconding, that's what you get when you ask for the codepoints. If you were using the UCS4 python build, you would get the other. So, I think that what you really want here is find out how to convert UFT-16 into UFT-32, regardless of how the originally was at the source code. – pablosaraiva Sep 21 '11 at 04:04
  • 2
    Yes, exactly. Well, to be exact, how to convert a Python 3 string to UTF-32, regardless of which encoding (UTF-16 or 32) the Python build it happens to be running on uses internally. – Ross Smith Sep 21 '11 at 04:45
  • 1
    @RossSmith: maybe adding that last part ("regardless of how Python 3 was built!") to your question would help, because it's easy to miss that part ... – Joachim Sauer Sep 21 '11 at 05:44
3

If you create the string as a unicode object, it should be able to break off a character at a time automatically. E.g.:

Python 2.6:

s = u"abc\u20ac\U00010302\U0010fffd"   # note u in front!
for c in s:
    print "U+%04x" % ord(c)

I received:

U+0061
U+0062
U+0063
U+20ac
U+10302
U+10fffd

Python 3.2:

s = "abc\u20ac\U00010302\U0010fffd"
for c in s:
    print ("U+%04x" % ord(c))

It worked for me:

U+0061
U+0062
U+0063
U+20ac
U+10302
U+10fffd

Additionally, I found this link which explains that the behavior as working correctly. If the string came from a file, etc, it will likely need to be decoded first.

Update:

I've found an insightful explanation here. The internal Unicode representation size is a compile-time option, and if working with "wide" chars outside of the 16 bit plane you'll need to build python yourself to remove the limitation, or use one of the workarounds on this page. Apparently many Linux distros do this for you already as I encountered above.

Community
  • 1
  • 1
Gringo Suave
  • 29,931
  • 6
  • 88
  • 75
  • 1
    Didn't get that with Python 3.2.1. What is the value of `sys.maxunicode` on your system? Perhaps you have a wide Unicode build? – Mark Tolonen Sep 21 '11 at 05:52
  • 2
    Gringo, you apparently have a Python build that uses UTF-32 internally. As pablosaraiva pointed out, this is not the default and can't be relied on in portable code. – Ross Smith Sep 21 '11 at 20:12
  • Interesting: `import sys; sys.maxunicode 1114111`. I'm using the Pythons packaged by Ubuntu Natty (debian). I suspect your setup is more "custom" ... Windows has always preferred the less common 16 bit variants. I would have expected python to handle details such as this transparently like it does int/long (for example). – Gringo Suave Sep 21 '11 at 21:57
  • 1
    I'm using the standard Windows distribution, directly from the official Python site. (I just tried it with the official Mac build, and it uses UTF-16 too.) I also expected this to be handled transparently; apparently Guido disagrees, unfortunately. – Ross Smith Sep 22 '11 at 01:41
  • Yes, (though UCS not UTF) learn something new everyday. I've linked to an answer with a great explanation. Perhaps they thought UCS-4 wastes too much space. – Gringo Suave Sep 22 '11 at 02:05