When you import pandas
, it does a whole ton of NumPy stuff, including calling UNICODE_setitem
on all the single-ASCII-letter strings, and presumably somewhere else doing something similar on the single-ASCII-digit strings.
That NumPy function calls the deprecated C API PyUnicode_AsUnicode
.
When you call that in CPython 3.3+, that caches the wchar_t *
representation on the string's internal struct, in its wstr
member, as the two wchar_t values w'a'
and '\0'
, which takes 8 bytes on a 32-bit-wchar_t
build of Python. And str.__size__
takes that into account.
So, all of the single-character interned strings for ASCII letters and digits—but nothing else—end up 8 bytes larger.
First, we know that it's apparently something that happens on import pandas
(per Brad Solomon's answer.) It may happen on np.set_printoptions(precision=4, threshold=625, edgeitems=10)
(miradulo posted, but then deleted, a comment to that effect on ShadowRanger's answer), but definitely not on import numpy
.
Second, we know that it happens to 'a'
, but what about other single-character strings?
To verify the former, and to test the latter, I ran this code:
import sys
strings = [chr(i) for i in (0, 10, 17, 32, 34, 47, 48, 57, 58, 64, 65, 90, 91, 96, 97, 102, 103, 122, 123, 130, 0x0222, 0x12345)]
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)
import numpy as np
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)
np.set_printoptions(precision=4, threshold=625, edgeitems=10)
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)
import pandas
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)
On multiple CPython installations (but all 64-bit CPython 3.4 or later on Linux or macOS), I got the same results:
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 50, '9': 50, ':': 50, '@': 50, 'A': 50, 'Z': 50, '[': 50, '`': 50, 'a': 50, 'f': 50, 'g': 50, 'z': 50, '{': 50, '\x82': 74, 'Ȣ': 76, '': 80}
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 50, '9': 50, ':': 50, '@': 50, 'A': 50, 'Z': 50, '[': 50, '`': 50, 'a': 50, 'f': 50, 'g': 50, 'z': 50, '{': 50, '\x82': 74, 'Ȣ': 76, '': 80}
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 50, '9': 50, ':': 50, '@': 50, 'A': 50, 'Z': 50, '[': 50, '`': 50, 'a': 50, 'f': 50, 'g': 50, 'z': 50, '{': 50, '\x82': 74, 'Ȣ': 76, '': 80}
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 58, '9': 58, ':': 50, '@': 50, 'A': 58, 'Z': 58, '[': 50, '`': 50, 'a': 58, 'f': 58, 'g': 58, 'z': 58, '{': 50, '\x82': 74, 'Ȣ': 76, '': 80}
So, import numpy
changes nothing, and so does set_printoptions
(presumably why miradulo deleted the comment…), but import pandas
does.
And it apparently affects ASCII digits and letters, but nothing else.
Also, if you change all of the print
s to print(sizes.values())
, so the strings never get encoded for output, you get the same results, which implies that either it's not about caching the UTF-8, or it is but that's always happening even if we don't force it.
The obvious possibility is that whatever Pandas is calling is using one of the legacy PyUnicode
API to generate single-character strings for all of the ASCII digits and letters. So these strings end up not in compact-ASCII format, but in legacy-ready format, right? (For details on what that means, see the comments in the source.)
Nope. Using the code from my superhackyinternals
, we can see that it's still in compact-ascii format:
import ctypes
import sys
from internals import PyUnicodeObject
s = 'a'
print(sys.getsizeof(s))
ps = PyUnicodeObject.from_address(s)
print(ps, ps.kind, ps.length, ps.interned, ps.ascii, ps.compact, ps.ready)
addr = id(s) + PyUnicodeObject.utf8_length.offset
buf = (ctypes.c_char * 2).from_address(addr)
print(addr, bytes(buf))
import pandas
print(sys.getsizeof(s))
s = 'a'
ps = PyUnicodeObject.from_address(s)
print(ps, ps.kind, ps.length, ps.interned, ps.ascii, ps.compact, ps.ready)
addr = id(s) + PyUnicodeObject.utf8_length.offset
buf = (ctypes.c_char * 2).from_address(addr)
print(addr, bytes(buf))
We can see that Pandas changes the size from 50 to 58, but the fields are still:
<__main__.PyUnicodeObject object at 0x101bbae18> 1 1 1 1 1 1
… in other words, it's 1BYTE_KIND
, length 1, mortal-interned, ASCII, compact, and ready.
But, if you look at ps.wstr
, before Pandas it's a null pointer, while after Pandas it's a pointer to the wchar_t
string w"a\0"
. And str.__sizeof__
takes that wstr
size into account.
So, the question is, how do you end up with an ascii-compact string that has a wstr
value?
Simple: you call PyUnicode_AsUnicode
on it (or one of the other deprecated functions or macros that accesses the 3.2-style native wchar_t *
internal storage. That native internal storage doesn't actually exist in 3.3+. So, for backward compatibility, those calls are handled by creating that storage on the fly, sticking it on the wstr
member, and calling the appropriate PyUnicode_AsUCS[24]
function to decode to that storage. (Unless you're dealing with a compact string whose kind happens to match the wchar_t
width, in which case wstr
is just a pointer to the native storage after all.)
You'd expect str.__sizeof__
to ideally include that extra storage, and from the source, you can see that it does.
Let's verify that:
import ctypes
import sys
s = 'a'
print(sys.getsizeof(s))
ctypes.pythonapi.PyUnicode_AsUnicode.argtypes = [ctypes.py_object]
ctypes.pythonapi.PyUnicode_AsUnicode.restype = ctypes.c_wchar_p
print(ctypes.pythonapi.PyUnicode_AsUnicode(s))
print(sys.getsizeof(s))
Tada, our 50 goes to 58.
So, how do you work out where this gets called?
There are actually a ton of calls to PyUnicode_AsUnicode
, and the PyUnicode_AS_UNICODE
macro, and other functions that call them, throughout Pandas and Numpy. So I ran Python in lldb and attached a breakpoint to PyUnicode_AsUnicode
, with a script that skips if the calling stack frame is the same as last time.
The first few calls involve datetime formats. Then there's one with a single letter. And the stack frame is:
multiarray.cpython-36m-darwin.so`UNICODE_setitem + 296
… and above multiarray
it's pure Python all the way up to the import pandas
. So, if you want to know exactly where Pandas is calling this function, you'd need to debug in pdb
, which I haven't done yet. But I think we've got enough info now.