How to find out if Python is compiled with UCS-2 or UCS-4?

Question

Just what the title says.

$ ./configure --help | grep -i ucs
  --enable-unicode[=ucs[24]]

Searching the official documentation, I found this:

sys.maxunicode: An integer giving the largest supported code point for a Unicode character. The value of this depends on the configuration option that specifies whether Unicode characters are stored as UCS-2 or UCS-4.

What is not clear here is - which value(s) correspond to UCS-2 and UCS-4.

The code is expected to work on Python 2.6+.

score 129 · Accepted Answer · answered Sep 18 '09 at 19:33

129

When built with --enable-unicode=ucs4:

>>> import sys
>>> print sys.maxunicode
1114111

When built with --enable-unicode=ucs2:

>>> import sys
>>> print sys.maxunicode
65535

answered Sep 18 '09 at 19:33

Stef

6,729
4
34
26

2

This is not universally correct anymore for Python 3. See https://docs.python.org/3.4/c-api/unicode.html: `Since the implementation of PEP 393 in Python 3.3, Unicode objects internally use a variety of representations`. https://www.python.org/dev/peps/pep-0393/ – Dr. Jan-Philip Gehrcke Oct 12 '15 at 09:40
2

@Jan-PhilipGehrcke: `deficient_unicode_build = (sys.maxunicode < 0x10ffff)` works on any Python version (even if the flexible internal representation is used where `sys.maxunicode == 0x10ffff`). The flexible representations allows to get correct results like ucs4 did on previous versions while using less memory than ucs4 in some cases. – jfs Mar 05 '16 at 19:11

score 20 · Answer 2 · answered Sep 18 '09 at 19:20

20

It's 0xFFFF (or 65535) for UCS-2, and 0x10FFFF (or 1114111) for UCS-4:

Py_UNICODE
PyUnicode_GetMax(void)
{
#ifdef Py_UNICODE_WIDE
    return 0x10FFFF;
#else
    /* This is actually an illegal character, so it should
       not be passed to unichr. */
    return 0xFFFF;
#endif
}

The maximum character in UCS-4 mode is defined by the maxmimum value representable in UTF-16.

answered Sep 18 '09 at 19:20

Martin v. Löwis

124,830
17
198
235

1

Note that this function is _not_ used to implement `sys.maxunicode` after python 3.3 (ie, all maintained version of python as of this comment) - it only concerns the size of the now-deprecated `Py_UNICODE` typedef. `maxunicode` originates from `SET_SYS_FROM_STRING("maxunicode", PyLong_FromLong(0x10FFFF));`. – Eric Jan 24 '20 at 15:39

score 11 · Answer 3 · answered Sep 20 '09 at 02:50

11

I had this same issue once. I documented it for myself on my wiki at

http://arcoleo.org/dsawiki/Wiki.jsp?page=Python%20UTF%20-%20UCS2%20or%20UCS4

I wrote -

import sys
sys.maxunicode > 65536 and 'UCS4' or 'UCS2'

answered Sep 20 '09 at 02:50

Dave

385
1
2

5

For anyone wondering what this does: it is an old (< Python 2.5) way of doing `'UCS4' if sys.maxunicode > 65536 else 'UCS2'`. – vaultah Aug 07 '16 at 15:00

score 10 · Answer 4 · answered Mar 04 '16 at 16:40

sysconfig will tell the unicode size from the configuration variables of python.

The buildflags can be queried like this.

Python 2.7:

import sysconfig
sysconfig.get_config_var('Py_UNICODE_SIZE')

Python 2.6:

import distutils
distutils.sysconfig.get_config_var('Py_UNICODE_SIZE')

score 1 · Answer 5 · edited Aug 17 '16 at 07:55

I had the same issue and found a semi-official piece of code that does exactly that and may be interesting for people with the same issue: https://bitbucket.org/pypa/wheel/src/cf4e2d98ecb1f168c50a6de496959b4a10c6b122/wheel/pep425tags.py?at=default&fileviewer=file-view-default#pep425tags.py-83:89.

It comes from the wheel project which needs to check if the python is compiled with ucs-2 or ucs-4 because it will change the name of the binary file generated.

user6758673 · Answer 6 · 2016-09-07T11:43:04.527

Another way is to create an Unicode array and look at the itemsize:

import array
bytes_per_char = array.array('u').itemsize

Quote from the array docs:

The 'u' typecode corresponds to Python’s unicode character. On narrow Unicode builds this is 2-bytes, on wide builds this is 4-bytes.

Note that the distinction between narrow and wide Unicode builds is dropped from Python 3.3 onward, see PEP393. The 'u' typecode for array is deprecated since 3.3 and scheduled for removal in Python 4.0.

score 0 · Answer 7 · answered Sep 18 '09 at 19:14

0

65535 is UCS-2:

Thus code point U+0000 is encoded as the number 0, and U+FFFF is encoded as 65535 (which is FFFF16 in hexadecimal).

answered Sep 18 '09 at 19:14

SilentGhost

307,395
66
306
293

How to find out if Python is compiled with UCS-2 or UCS-4?

7 Answers7

Linked