5

On 64-bit Debian Linux 6:

Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxint
9223372036854775807
>>> sys.maxunicode
1114111

On 64-bit Windows 7:

Python 2.7.1 (r271:86832, Nov 27 2010, 17:19:03) [MSC v.1500 64 bit (AMD64)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxint
2147483647
>>> sys.maxunicode
65535

Both Operating Systems are 64-bit. They have sys.maxunicode, according to wikipedia There are 1,114,112 code points in unicode. Is sys.maxunicode on Windows wrong?

And why do they have different sys.maxint?

Tyler Liu
  • 19,552
  • 11
  • 100
  • 84

2 Answers2

4

I don't know what your question is, but sys.maxunicode is not wrong on Windows.

See the docs:

sys.maxunicode

An integer giving the largest supported code point for a Unicode character. The value of this depends on the configuration option that specifies whether Unicode characters are stored as UCS-2 or UCS-4.

Python on Windows uses UCS-2, so the largest code point is 65,535 (and the supplementary-plane characters are encoded by 2*16 bit "surrogate pairs").

About sys.maxint, this shows at which point Python 2 switches from "simple integers" (123) to "long integers" (12345678987654321L). Obviously Python for Windows uses 32 bits, and Python for Linux uses 64 bits. Since Python 3, this has become irrelevant because the simple and long integer types have been merged into one. Therefore, sys.maxint is gone from Python 3.

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • 3
    i would also add that `sys.maxunicode` has no relation whatsoever with `sys.maxint`. – Adrien Plisson Nov 17 '11 at 09:35
  • 4
    As I understand it, "surrogate pairs" apply only to UTF-16; UCS-2 is simply incapable of representing characters past 65535. – Keith Thompson Nov 17 '11 at 09:53
  • Why does python for windows use 32 bit while my OS is 64 bit? And why is it not the same on Linux? – Tyler Liu Nov 17 '11 at 09:55
  • 2
    @TimPietzcker: I would like to add a pointer to the documentation about supplementary character planes: "Any Unicode character can be encoded [with \Uxxxxxxxx], but characters outside the Basic Multilingual Plane (BMP) will be encoded using a surrogate pair if Python is compiled to use 16-bit code units (the default). Individual code units which form parts of a surrogate pair can be encoded using this escape sequence." (http://docs.python.org/reference/lexical_analysis.html#string-literals). – Eric O. Lebigot Nov 17 '11 at 10:11
  • 2
    @KeithThompson: it looks like Python can encode characters outside of the Basic Multilingual Plane (BMP) even when it has `sys.maxunicode==65535`: `print repr(u"\U00010120")` correctly returns the original input string representation. So, it looks like Python is using UCS-2 internally, with a convention that allows it to represent characters outside of the BMP. In fact, if you look at the internal representation with `u"\U00010120".encode('unicode_internal').encode('hex')`, you see that Python uses the special code `0xd800`, which is guaranteed not to point to any character (like d800-dfff). – Eric O. Lebigot Nov 17 '11 at 10:29
  • 1
    Is UCS-2 "with a convention that allows it to represent characters outside the BMP" just a way to describe UTF-16, or does Python's convention differ from UTF-16? – Keith Thompson Nov 17 '11 at 20:09
  • 1
    @EOL: So technically it's "largest supported code *unit*" instead of "largest supported code *point*". – dan04 Nov 18 '11 at 14:01
  • @dan04: Yeah, that's my understanding, but I'm no expert. :) – Eric O. Lebigot Nov 18 '11 at 16:42
1

Regarding the difference is sys.maxint, see What is the bit size of long on 64-bit Windows?. Python uses the long type internally to store a small integer on Python 2.x.

Community
  • 1
  • 1
casevh
  • 11,093
  • 1
  • 24
  • 35