Two reasons:
Every string in CPython has a fairly large amount of boilerplate in its C object header; on a Python 2 64 bit system, the empty unicode
object uses 52 bytes, and that's the fixed overhead of every unicode
object, before you even count the data it contains. If you have 1.14M unicode
objects (that aren't singletons like u''
), then you're using nearly 6 GB on the per-object overhead alone.
You're on Python 2 and decode
ing from str
to unicode
, which depending on your build configuration for Python 2, uses a fixed 2 or 4 bytes per character, even for purely ASCII strings; based on your numbers, you're on a 4 bytes/char system. So instead of the data taking 543 MB beyond the object header overhead, it takes a titch over 2 GB.
The header issue is largely insurmountable (Python objects will always have a few dozen bytes wasted on the header); every Python object has a high fixed overhead (As noted, sys.getsizeof(u'')
on my x64 system is 52, despite storing only eight bytes of "real" data, the str
's length).
But since your input appears mostly ASCII, you could reduce your memory usage by moving to Python 3; in modern Py3 (3.3+ IIRC), they use dynamically sized storage for str
; a str
that only uses ASCII/latin-1 characters will use one byte per character (latin-1 makes the fixed overhead a little higher than ASCII, but the cost per character remains 1), not two or four (and anything in the Basic Multilingual Plane will use two bytes per character, not four; only non-BMP strings need four bytes per character). The header for str
is also a little smaller (sys.getsizeof('') == 49
, not 52), so you'd expect to reduce memory consumption by about 350 MB for the headers, and 1.5 GB for the more compact data storage (since it's mostly ASCII).
Just use Py 3 and change the code to:
with open('tokens.txt', 'r', encoding='utf-8') as f:
tokens = f.read().split()
import sys
print(sys.getsizeof(tokens))
print(sum(sys.getsizeof(t) for t in tokens))
and you should see memory use for the strings reduce, significantly in the case of longer strings (e.g. on my Linux x64 install, u'examplestring'
is 104 bytes on Py2 compiled with 4 bytes/char unicode
, and only 62 bytes on Py3).
Alternatively, as a cheap hack, you could try to convert back from unicode
to str
on Py2 when you know it's pure ASCII; on Py2, the two types are largely interoperable, and str
has a smaller per-object overhead (37 bytes vs. 52), and only uses one byte/char. Converting from unicode
back to ASCII manually is feasible, though it will slow you down. To do this, change your code to:
# Open in binary mode
with open('tokens.txt', 'rb') as f:
# Defer decode and only do it for str with non-ASCII bytes
# producing list of mostly ASCII str with a few unicode objects
# when non-ASCII appears
tokens = [w.decode('utf-8') if max(w) > '\x7f' else w
for w in f.read().split()]
import sys
print sys.getsizeof(tokens)
print sum(sys.getsizeof(t) for t in tokens)
That should save you ~1.7 GB on per-object headers, and the same ~1.5 GB on data storage, in exchange for potentially opening you to the str
/unicode
interoperability quirks that Py2 has (and were a large part of the motivation for separating bytes
and str
in Py 3).