5

Why is sys.getsizeof() larger for a Python str of length 1 than for a string of length 2? (For length > 2, the relationship seems to increase monotonically as expected.)

Example:

>>> from string import ascii_lowercase
>>> import sys

>>> strings = [ascii_lowercase[:i] for i, _ in enumerate(ascii_lowercase, 1)]
>>> strings
['a',
 'ab',
 'abc',
 'abcd',
 'abcde',
 'abcdef',
 'abcdefg',
 # ...

>>> sizes = dict(enumerate(map(sys.getsizeof, strings), 1))
>>> sizes
{1: 58,   # <--- ??
 2: 51,
 3: 52,
 4: 53,
 5: 54,
 6: 55,
 7: 56,
 8: 57,
 9: 58,
 10: 59,
 11: 60,
 12: 61,
 13: 62,
 14: 63,
 15: 64,
 16: 65,
 # ...

It seems it has to do with str.__sizeof__, but I don't know C well enough at all to dig into what's going on in this case.


Edit:

This appears to be related to a single Pandas import in an IPython startup file.

I can reproduce the behavior in a plain Python session also:

 ~$ python
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from string import ascii_lowercase
>>> import sys
>>> strings = [ascii_lowercase[:i] for i, _ in enumerate(ascii_lowercase, 1)]
>>> sizes = dict(enumerate(map(sys.getsizeof, strings), 1))
>>> sizes
{1: 50, 2: 51, 3: 52, 4: 53, 5: 54, 6: 55, 7: 56, 8: 57, 9: 58, 10: 59, 11: 60, 12: 61, 13: 62, 14: 63, 15: 64, 16: 65, 17: 66, 18: 67, 19: 68, 20: 69, 21: 70, 22: 71, 23: 72, 24: 73, 25: 74, 26: 75}
>>> import pandas as pd
>>> sizes = dict(enumerate(map(sys.getsizeof, strings), 1))
>>> sizes
{1: 58, 2: 51, 3: 52, 4: 53, 5: 54, 6: 55, 7: 56, 8: 57, 9: 58, 10: 59, 11: 60, 12: 61, 13: 62, 14: 63, 15: 64, 16: 65, 17: 66, 18: 67, 19: 68, 20: 69, 21: 70, 22: 71, 23: 72, 24: 73, 25: 74, 26: 75}
>>> pd.__version__
'0.23.2'
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • 2
    What version of Python 3.x are you on? (And platform, 32- or 64-bit, python.org installer or some other thing, etc.) Because I get `50` for the first string on every 64-bit CPython 3.4, 3.6, or 3.7 that I have access to. – abarnert Jul 16 '18 at 22:00
  • 1
    I'm also getting 50. – user2357112 Jul 16 '18 at 22:00
  • Seconding abarnert, no repro on 3.6.6 or 3.7. – miradulo Jul 16 '18 at 22:00
  • [Same on repl.it](https://repl.it/repls/OpaqueFittingParallelprocessing) (although that's not surprising, as I suspect they're running the same 3.6.1 distro package as one of my two linux containers…). – abarnert Jul 16 '18 at 22:02
  • Interesting -- I'm on 3.6.6, Anaconda distro, 64bit. Let me double-check question code. – Brad Solomon Jul 16 '18 at 22:03
  • Anyway, under the covers, Python strings (in CPython 3.4+) are pretty complicated things; there are a number of different ways to store the string `'a'`, which would all take different sizes. I suppose it's possible that if you managed to construct a 58-byte `'a'` by doing something weird, then get it interned, that would be a valid answer, but the explanation would involve whatever code you did earlier in the same session to generate and intern that `'a'` in an unusual way. – abarnert Jul 16 '18 at 22:04
  • This is weird, my results are entirely deterministic and within expectation: ``` In [38]: thing = ['a','aa'] In [41]: for a in map(sys.getsizeof, thing): ...: print(a) ...: 50 51 In [42]: dict(enumerate(map(sys.getsizeof, thing),1)) Out[42]: {1: 50, 2: 51} In [43]: ``` In [43]: sys.version Out[43]: '3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) \n[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]' I am on Ipython. – PoweredBy90sAi Jul 16 '18 at 22:05
  • @abarnert interestingly, I get 50 on python but 58 on IPython. – Brad Solomon Jul 16 '18 at 22:06
  • @BradSolomon: I get 50 on IPython. – user2357112 Jul 16 '18 at 22:08
  • @BradSolomon Is that a brand new IPython session, or a long-running one where you might have done something earlier that, e.g., created the string `'a'` using a C extension that’s meant to be backward compatible with 3.3? – abarnert Jul 16 '18 at 22:08
  • Brand new, re-launched and ran nothing but these 4 lines. and ipython version == 6.4.0 – Brad Solomon Jul 16 '18 at 22:10
  • If you really want to know what’s going on, you first need to dig into the struct by hacking your way underneath the C API and see exactly what’s being stored there. Which is painful; [this code](https://github.com/abarnert/superhackyinternals) can help, but you’ll need to read how the internals work before you can even use it. – abarnert Jul 16 '18 at 22:16
  • Once we know that, then we’ll probably need a C-level debugger to figure out where the string is first being constructed and trace that back to… probably some Python code in your IPython startup or something. – abarnert Jul 16 '18 at 22:18
  • Alternatively, you could bisect on your startup file to narrow it down to the one line that seems to cause it. – abarnert Jul 16 '18 at 22:19
  • That's what I'm taking a look at @abarnert – Brad Solomon Jul 16 '18 at 22:19

2 Answers2

5

When you import pandas, it does a whole ton of NumPy stuff, including calling UNICODE_setitem on all the single-ASCII-letter strings, and presumably somewhere else doing something similar on the single-ASCII-digit strings.

That NumPy function calls the deprecated C API PyUnicode_AsUnicode.

When you call that in CPython 3.3+, that caches the wchar_t * representation on the string's internal struct, in its wstr member, as the two wchar_t values w'a' and '\0', which takes 8 bytes on a 32-bit-wchar_t build of Python. And str.__size__ takes that into account.

So, all of the single-character interned strings for ASCII letters and digits—but nothing else—end up 8 bytes larger.


First, we know that it's apparently something that happens on import pandas (per Brad Solomon's answer.) It may happen on np.set_printoptions(precision=4, threshold=625, edgeitems=10) (miradulo posted, but then deleted, a comment to that effect on ShadowRanger's answer), but definitely not on import numpy.

Second, we know that it happens to 'a', but what about other single-character strings?

To verify the former, and to test the latter, I ran this code:

import sys

strings = [chr(i) for i in (0, 10, 17, 32, 34, 47, 48, 57, 58, 64, 65, 90, 91, 96, 97, 102, 103, 122, 123, 130, 0x0222, 0x12345)]

sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)

import numpy as np
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)

np.set_printoptions(precision=4, threshold=625, edgeitems=10)
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)

import pandas
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)

On multiple CPython installations (but all 64-bit CPython 3.4 or later on Linux or macOS), I got the same results:

{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 50, '9': 50, ':': 50, '@': 50, 'A': 50, 'Z': 50, '[': 50, '`': 50, 'a': 50, 'f': 50, 'g': 50, 'z': 50, '{': 50, '\x82': 74, 'Ȣ': 76, '': 80}
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 50, '9': 50, ':': 50, '@': 50, 'A': 50, 'Z': 50, '[': 50, '`': 50, 'a': 50, 'f': 50, 'g': 50, 'z': 50, '{': 50, '\x82': 74, 'Ȣ': 76, '': 80}
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 50, '9': 50, ':': 50, '@': 50, 'A': 50, 'Z': 50, '[': 50, '`': 50, 'a': 50, 'f': 50, 'g': 50, 'z': 50, '{': 50, '\x82': 74, 'Ȣ': 76, '': 80}
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 58, '9': 58, ':': 50, '@': 50, 'A': 58, 'Z': 58, '[': 50, '`': 50, 'a': 58, 'f': 58, 'g': 58, 'z': 58, '{': 50, '\x82': 74, 'Ȣ': 76, '': 80}

So, import numpy changes nothing, and so does set_printoptions (presumably why miradulo deleted the comment…), but import pandas does.

And it apparently affects ASCII digits and letters, but nothing else.

Also, if you change all of the prints to print(sizes.values()), so the strings never get encoded for output, you get the same results, which implies that either it's not about caching the UTF-8, or it is but that's always happening even if we don't force it.


The obvious possibility is that whatever Pandas is calling is using one of the legacy PyUnicode API to generate single-character strings for all of the ASCII digits and letters. So these strings end up not in compact-ASCII format, but in legacy-ready format, right? (For details on what that means, see the comments in the source.)

Nope. Using the code from my superhackyinternals, we can see that it's still in compact-ascii format:

import ctypes
import sys
from internals import PyUnicodeObject

s = 'a'
print(sys.getsizeof(s))
ps = PyUnicodeObject.from_address(s)
print(ps, ps.kind, ps.length, ps.interned, ps.ascii, ps.compact, ps.ready)
addr = id(s) + PyUnicodeObject.utf8_length.offset
buf = (ctypes.c_char * 2).from_address(addr)
print(addr, bytes(buf))

import pandas
print(sys.getsizeof(s))
s = 'a'
ps = PyUnicodeObject.from_address(s)
print(ps, ps.kind, ps.length, ps.interned, ps.ascii, ps.compact, ps.ready)
addr = id(s) + PyUnicodeObject.utf8_length.offset
buf = (ctypes.c_char * 2).from_address(addr)
print(addr, bytes(buf))

We can see that Pandas changes the size from 50 to 58, but the fields are still:

<__main__.PyUnicodeObject object at 0x101bbae18> 1 1 1 1 1 1

… in other words, it's 1BYTE_KIND, length 1, mortal-interned, ASCII, compact, and ready.

But, if you look at ps.wstr, before Pandas it's a null pointer, while after Pandas it's a pointer to the wchar_t string w"a\0". And str.__sizeof__ takes that wstr size into account.


So, the question is, how do you end up with an ascii-compact string that has a wstr value?

Simple: you call PyUnicode_AsUnicode on it (or one of the other deprecated functions or macros that accesses the 3.2-style native wchar_t * internal storage. That native internal storage doesn't actually exist in 3.3+. So, for backward compatibility, those calls are handled by creating that storage on the fly, sticking it on the wstr member, and calling the appropriate PyUnicode_AsUCS[24] function to decode to that storage. (Unless you're dealing with a compact string whose kind happens to match the wchar_t width, in which case wstr is just a pointer to the native storage after all.)

You'd expect str.__sizeof__ to ideally include that extra storage, and from the source, you can see that it does.

Let's verify that:

import ctypes
import sys
s = 'a'
print(sys.getsizeof(s))
ctypes.pythonapi.PyUnicode_AsUnicode.argtypes = [ctypes.py_object]
ctypes.pythonapi.PyUnicode_AsUnicode.restype = ctypes.c_wchar_p
print(ctypes.pythonapi.PyUnicode_AsUnicode(s))
print(sys.getsizeof(s))

Tada, our 50 goes to 58.


So, how do you work out where this gets called?

There are actually a ton of calls to PyUnicode_AsUnicode, and the PyUnicode_AS_UNICODE macro, and other functions that call them, throughout Pandas and Numpy. So I ran Python in lldb and attached a breakpoint to PyUnicode_AsUnicode, with a script that skips if the calling stack frame is the same as last time.

The first few calls involve datetime formats. Then there's one with a single letter. And the stack frame is:

multiarray.cpython-36m-darwin.so`UNICODE_setitem + 296

… and above multiarray it's pure Python all the way up to the import pandas. So, if you want to know exactly where Pandas is calling this function, you'd need to debug in pdb, which I haven't done yet. But I think we've got enough info now.

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
abarnert
  • 354,177
  • 51
  • 601
  • 671
4

Python 3.3+'s str is quite a complicated structure, and can end up storing the underlying data in up to three different ways, depending on which APIs have been used with the string and the code points represented by the string. The most common alternate representation case is a cached UTF-8 representation, but that only applies to non-ASCII strings so it doesn't apply here.

In this case, I suspect the single character string (which, as an implementation detail, is a singleton) was used in a way that triggered the creation of the legacy wchar_t* representation (extensions using the legacy Py_UNICODE APIs can cause this), and your Python build uses a four byte wchar_t, leading to the string being eight bytes bigger than it otherwise would be (four for the a itself, four more for the NUL terminator). The fact that it's a singleton means that even though you may never have triggered such a legacy API call, any extension which retrieved a reference to the singleton would affect the observed size for everyone by using it with the legacy API.

Personally, I don't reproduce at all on my Linux 3.6.5 install (the sizes increase smoothly), indicating no wchar_t representation was created, and on my Windows 3.6.3 install, 'a' is only 54 bytes, not 58 (which matched Windows' native two byte wchar_t). In both cases I'm running with ipython; it's possible different ipython dependencies with different versions are responsible for your (and my) inconsistent observations.

To be clear, this extra cost is fairly immaterial; since the single character string is a singleton, the incremental cost of use is really just 4-8 bytes (depending on pointer width). You're not going to break the bank on memory if a handful of strings ended up used with the legacy APIs.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • The most common case is not the cached UTF-8 representation, it’s a 1-, 2-, or 4-byte fixed width string, which may or may not also have a cached UTF-8 rep (which can be the same as the fixed width only if that’s 1-byte). – abarnert Jul 16 '18 at 22:12
  • @abarnert: I phrased it poorly; I meant the most common *alternate* representation (I've edited it to fix that statement). And it's not the same as the fixed width if the fixed width is 1 byte and latin-1, only if it's 1 byte and ASCII. – ShadowRanger Jul 16 '18 at 22:14
  • I have a `startup.py` file that is executed at launch. When I comment that out, this inconsistency goes away. The only funky thing it does is temporarily redirect `sys.stdout` to `os.devnull`. But I have no clue whether that is the cause. Suffice it to say your answer is helpful regardless. – Brad Solomon Jul 16 '18 at 22:14
  • @BradSolomon: Does it use *any* non-builtin APIs? Granted, my reproducing setup only does some tweaking with `prompt_toolkit` (an `ipython` dependency which I'd assume wouldn't change anything). I'm content to just call it a mystery of non-default setups. – ShadowRanger Jul 16 '18 at 22:15
  • Yes, pandas & numpy. It's [here](https://github.com/bsolomon1124/config/blob/master/ipython/startup.py) in its entirety. But please don't trouble yourself, as you said this is a minor thing and seems specific to my setup. – Brad Solomon Jul 16 '18 at 22:18
  • @ShadowRanger That’s why I said "can… if…” as opposed to “will” or “iff” (because I didn’t want to go into all the details in a comment). I just wanted to point out that it was a bit misleading as phrased, assuming you’d be able to fix it better than me. :) – abarnert Jul 16 '18 at 22:20
  • 1
    @miradulo Not surprising; that function does all kinds of stuff, and does it in C code originally written for early 2.x Python, and code that probably hasn’t been a priority to update (since you obviously never use it in an inner loop or anything), so… – abarnert Jul 16 '18 at 22:24
  • @abarnert: miradulo deleted their comment for some reason, which API was the cause? – ShadowRanger Jul 16 '18 at 22:27
  • @ShadowRanger `set_printoptions` – abarnert Jul 16 '18 at 22:29
  • @ShadowRanger I thought I'd noticed before too many people had seen that my assertion was wrong - I too had a Pandas import statement snuck in between sessions. Apologies for the confusion. – miradulo Jul 17 '18 at 00:52