Why are Unicode strings having a different memory footprint in Python 2 and 3?

Question

In Python 2, an empty string occupy exactly 37 bytes,

>>>> print sys.getsizeof('')
37

In Python 3.6, the same call would output 49 bytes,

>>>> print(sys.getsizeof(''))
49

Now I thought that this was due to the fact that in Python 3, all strings are now Unicode. But, to my surprise here are some confusing outputs,

Python 2.7

>>>> print sys.getsizeof(u'')
52
>>>> print sys.getsizeof(u'1')
56

Python 3.6

>>>>print(sys.getsizeof(''))
49
>>>>print(sys.getsizeof('1'))
50

An empty string is not the same size.
4 additional bytes are needed when adding a character in Python 2 and only one for Python 3

Why is the memory footprint different between the two versions ?

EDIT

I specified the exact version of my python environment, because between different Python 3 builds there are differences.

https://stackoverflow.com/questions/26079392/how-is-unicode-represented-internally-in-python — Josh Lee, Aug 07 '18 at 17:52

score 4 · Accepted Answer · answered Aug 07 '18 at 17:51

4

There are reasons of course, but really, it should not matter for any practical purposes. If you have a Python system in which you have to keep so many strings in memory as to get close to the system memory, you should optimize it by (1) trying to lazily load/create strings in memory or (2) Use a byte-oriented efficient binary structure to deal with your data, such as those provided by Numpy, or Python's own bytearray.

The change for the empty string literal (unicode literal fro Py2) could bedue to any implementation details between the versions you are looking at, which should not matter even if were writting C code to interact directly with Python strings: even those should only touch the strings via the API.

Now, the specific reason for why the string in Python 3 just increases its size by "1" byte, while in Python 2 it increases the size by 4 bytes is PEP 393.

Before Python 3.3, any (unicode) string in Python would use either fixed 2 bytes or fixed 4 bytes of memory for each character - and the Python interpreter and Python modules using native code would have to be compiled to use just one of these kinds. I.E. you efectively could have incompatible Python binaries, even if the versions matched, due to the string-width optoin picked up at build time - the builds were known as "narrow build" and "wide build". With the above mentioned PEP 391, Python strings have their character sized determined when they are instantiate, depending on the size of the widest unicode codepoint it contains. Strings that contain points that are contained in the first 256 codepoints (equivalent to the Latin-1 character set) use only 1 byte per character.

answered Aug 07 '18 at 17:51

jsbueno

99,910
10
151
209

I agree with you and quite frankly I was playing with strings when I found that. There is no particular use-case for my question, it was out of pure curiosity. But, I like to think that there should be a reason if they actually put effort in implementing an encoding system that use 1 byte instead of a fixed 2 or 4. I mean it implies that it is somewhat relevant am I wrong ? – scharette Aug 07 '18 at 17:57
@scharette It's an optimization that benefits a lot of programs without the programmers even having to know about it. Just like you don't have to understand how PyPy's JIT works for your code to go faster, you don't have to understand how CPython 3.7's Unicode storage works for your code to take less memory. It's documented for the rare cases where you _do_ need to know, but most people never do. (In fact, I suspect most people know out of curiosity, or out of hacking on CPython itself, than because they needed to learn it for a Python program.) – abarnert Aug 07 '18 at 18:14
It's true that, even when using C code, you should only touch the strings via the API. But, unfortunately, 3.3 had to change the C API, deprecating most of the old functions, to make the optimization work efficiently and smoothly, and the changes refer you to PEP 393, so it's not quite as invisible as it ideally should be. – abarnert Aug 07 '18 at 18:16
@abarnert True, I just like to think that those questions deserve their place on SO since they can sometimes lead to a better understanding of a complex concept. – scharette Aug 07 '18 at 18:21
@scharette Sure, that's why I write stuff like [this horrible module](https://github.com/abarnert/superhackyinternals) and answer questions about it. I certainly _hope_ nobody's actually using it for anything other than understanding CPython better. – abarnert Aug 07 '18 at 18:24
Ok Sorry if mu answer reads like one should not be playing around and asking things - of course tinkering and understanding is quite important. I can tell that when I drafted the answer I was afraid the doubt was due to you trying to sort out this difference for production code. – jsbueno Aug 07 '18 at 18:51

Dietrich Epp · Answer 2 · 2018-08-07T17:51:49.313

3

Internally, Python 3 now stores strings in four different encodings, and chooses a different encoding for each string. These encodings are ASCII, LATIN-1, UCS-2, and UTF-32. Each one is capable of representing a different subset of Unicode characters and has the useful property that the element at index i is also the Unicode code point at index i.

In [1]: import sys

In [2]: sys.getsizeof('\xFF')
Out[2]: 74

In [3]: sys.getsizeof('X\xFF')
Out[3]: 75

In [4]: sys.getsizeof('\u0100')
Out[4]: 76

In [5]: sys.getsizeof('X\u0100')
Out[5]: 78

In [6]: sys.getsizeof('\U00010000')
Out[6]: 80

In [7]: sys.getsizeof('X\U00010000')
Out[7]: 84

You can see that adding an additional character, in this case 'X', to a string causes that string to take up additional space depending on the values contained in the rest of the string.

This system was proposed in PEP-0393 and implemented in Python 3.3. Earlier versions of Python use the older unicode representation, which always used 2 or 4 bytes per... element (I hesitate to say "character"), depending on compile-time options and they could not be mixed.

edited Aug 07 '18 at 17:51

answered Aug 07 '18 at 17:44

Dietrich Epp

205,541
37
345
415

It's probably worth mentioning that, by comparison, Python 2 (and early Python 3) stores `unicode` strings in UTF-32. (Or, if you use a "narrow build", it stores them in UTF-16 and acts as if surrogates were separate characters.) – abarnert Aug 07 '18 at 17:47
@abarnert: Earlier versions of Python 3 also do this. – Dietrich Epp Aug 07 '18 at 17:48
Yes, that's why I said "(and early Python 3)". – abarnert Aug 07 '18 at 18:12
1

Damn my eyes! Nevertheless that information is already included in the answer. – Dietrich Epp Aug 07 '18 at 18:20
By the way, the core devs also hesitated to say "character". Whenever they added Unicode 2.0 support (somewhere around Python 2.1-ish?) and introduced narrow builds, they stripped the word "character" out of the C API docs and made sure to explicitly call them things like "`Py_UNICODE` values" instead. (But, oddly, they still refer to UCS2 instead of UTF-16 all the way up to 3.2.) – abarnert Aug 07 '18 at 18:21

Why are Unicode strings having a different memory footprint in Python 2 and 3?

2 Answers2