supplemental codepoints to unicode string in python

Question

unichr(0x10000) fails with a ValueError when cpython is compiled without --enable-unicode=ucs4.

Is there a language builtin or core library function that converts an arbitrary unicode scalar value or code-point to a unicode string that works regardless of what kind of python interpreter the program is running on?

I’m pretty sure that this can’t be done, and that it is one of the reasons you can’t trust somebody else’s Python to run on arbitrary Unicode data. However, this seems to be fixed in the v3.3 release. If you want abstract Unicode, you have to wait for the next release, or use a more robust platform. — tchrist, Feb 15 '12 at 00:42
@tchrist, Thanks. Yeah. I need to learn Python3.x. It seems to fix a lot of little sources of confusion. — Mike Samuel, Feb 15 '12 at 11:31
I (mostly) disagree with @tchrist that it can't be done; see my answer below where I do it. — Jim DeLaHunt, Nov 18 '12 at 00:15

Jim DeLaHunt · Accepted Answer · 2012-11-18T08:21:02.693

Yes, here you go:

>>> unichr(0xd800)+unichr(0xdc00)
u'\U00010000'

The crucial point to understand is that unichr() converts an integer to a single code unit in the Python interpreter's string encoding. The The Python Standard Library documentation for 2.7.3, 2. Built-in Functions, on unichr() reads,

Return the Unicode string of one character whose Unicode code is the integer i.... The valid range for the argument depends how Python was configured – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF]. ValueError is raised otherwise.

I added emphasis to "one character", by which they mean "one code unit" in Unicode terms.

I'm assuming that you are using Python 2.x. The Python 3.x interpreter has no built-in unichr() function. Instead the The Python Standard Library documentation for 3.3.0, 2. Built-in Functions, on chr() reads,

Return the string representing a character whose Unicode codepoint is the integer i.... The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16).

Note that the return value is now a string of unspecified length, not a string with a single code unit. So in Python 3.x, chr(0x10000) would behave as you expected. It "converts an arbitrary unicode scalar value or code-point to a unicode string that works regardless of what kind of python interpreter the program is running on".

But back to Python 2.x. If you use unichr() to create Python 2.x unicode objects, and you are using Unicode scalar values above 0xFFFF, then you are committing your code to being aware of the Python interpreter's implementation of unicode objects.

You can isolate this awareness with a function which tries unichr() on a scalar value, catches ValueError, and tries again with the corresponding UTF-16 surrogate pair:

def unichr_supplemental(scalar):
     try:
         return unichr(scalar)
     except ValueError:
         return unichr( 0xd800 + ((scalar-0x10000)//0x400) ) \
               +unichr( 0xdc00 + ((scalar-0x10000)% 0x400) )

>>> unichr_supplemental(0x41),len(unichr_supplemental(0x41))
(u'A', 1)
>>> unichr_supplemental(0x10000), len(unichr_supplemental(0x10000))
(u'\U00010000', 2)

But you might find it easier to just convert your scalars to 4-byte UTF-32 values in a UTF-32 byte string, and decode this byte string into a unicode string:

>>> '\x00\x00\x00\x41'.decode('utf-32be'), \
... len('\x00\x00\x00\x41'.decode('utf-32be'))
(u'A', 1)
>>> '\x00\x01\x00\x00'.decode('utf-32be'), \
... len('\x00\x01\x00\x00'.decode('utf-32be'))
(u'\U00010000', 2)

The code above was tested on Python 2.6.7 with UTF-16 encoding for Unicode strings. I didn't test it on a Python 2.x intepreter with UTF-32 encoding for Unicode strings. However, it should work unchanged on any Python 2.x interpreter with any Unicode string implementation.

Good answer. Note that the most recent Python release got rid of the whole "wide build" issue, which also helps these things a great deal. If you are running earlier release, you should certainly use a "wide build". — tchrist, Nov 18 '12 at 00:54
You are correct about 2.x. Thanks for the pointers to the specs and the explanation of the differences between them. — Mike Samuel, Nov 18 '12 at 08:18

supplemental codepoints to unicode string in python

1 Answers1