4

I have a 543 MB txt file containing a single line of space separated, utf-8 tokens:

aaa algeria americansamoa appliedethics accessiblecomputing ada anarchism ...

But, when I load this text data into a python list, it uses ~8 GB of memory (~900 MB for the list and ~8 GB for the tokens):

with open('tokens.txt', 'r') as f:
    tokens = f.read().decode('utf-8').split()

import sys

print sys.getsizeof(tokens)
# 917450944 bytes for the list
print sum(sys.getsizeof(t) for t in tokens)
# 7067732908 bytes for the actual tokens

I expected the memory usage to be approximately file size + list overhead = 1.5 GB. Why do the tokens consume so much more memory when loaded into a list?

jkarimi
  • 1,247
  • 2
  • 15
  • 27

1 Answers1

6

Two reasons:

  1. Every string in CPython has a fairly large amount of boilerplate in its C object header; on a Python 2 64 bit system, the empty unicode object uses 52 bytes, and that's the fixed overhead of every unicode object, before you even count the data it contains. If you have 1.14M unicode objects (that aren't singletons like u''), then you're using nearly 6 GB on the per-object overhead alone.

  2. You're on Python 2 and decodeing from str to unicode, which depending on your build configuration for Python 2, uses a fixed 2 or 4 bytes per character, even for purely ASCII strings; based on your numbers, you're on a 4 bytes/char system. So instead of the data taking 543 MB beyond the object header overhead, it takes a titch over 2 GB.

The header issue is largely insurmountable (Python objects will always have a few dozen bytes wasted on the header); every Python object has a high fixed overhead (As noted, sys.getsizeof(u'') on my x64 system is 52, despite storing only eight bytes of "real" data, the str's length).

But since your input appears mostly ASCII, you could reduce your memory usage by moving to Python 3; in modern Py3 (3.3+ IIRC), they use dynamically sized storage for str; a str that only uses ASCII/latin-1 characters will use one byte per character (latin-1 makes the fixed overhead a little higher than ASCII, but the cost per character remains 1), not two or four (and anything in the Basic Multilingual Plane will use two bytes per character, not four; only non-BMP strings need four bytes per character). The header for str is also a little smaller (sys.getsizeof('') == 49, not 52), so you'd expect to reduce memory consumption by about 350 MB for the headers, and 1.5 GB for the more compact data storage (since it's mostly ASCII).

Just use Py 3 and change the code to:

with open('tokens.txt', 'r', encoding='utf-8') as f:
    tokens = f.read().split()

import sys

print(sys.getsizeof(tokens))
print(sum(sys.getsizeof(t) for t in tokens))

and you should see memory use for the strings reduce, significantly in the case of longer strings (e.g. on my Linux x64 install, u'examplestring' is 104 bytes on Py2 compiled with 4 bytes/char unicode, and only 62 bytes on Py3).

Alternatively, as a cheap hack, you could try to convert back from unicode to str on Py2 when you know it's pure ASCII; on Py2, the two types are largely interoperable, and str has a smaller per-object overhead (37 bytes vs. 52), and only uses one byte/char. Converting from unicode back to ASCII manually is feasible, though it will slow you down. To do this, change your code to:

# Open in binary mode
with open('tokens.txt', 'rb') as f:
    # Defer decode and only do it for str with non-ASCII bytes
    # producing list of mostly ASCII str with a few unicode objects
    # when non-ASCII appears
    tokens = [w.decode('utf-8') if max(w) > '\x7f' else w
              for w in f.read().split()]

import sys

print sys.getsizeof(tokens)
print sum(sys.getsizeof(t) for t in tokens)

That should save you ~1.7 GB on per-object headers, and the same ~1.5 GB on data storage, in exchange for potentially opening you to the str/unicode interoperability quirks that Py2 has (and were a large part of the motivation for separating bytes and str in Py 3).

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • If you're curious, the 52 bytes used by the x64 Py2 `unicode` are: 16 bytes for generic non-GC-ed object header (composed of an 8 byte refcnt, and an 8 byte pointer to the class object), 8 more for the length, 8 more for the pointer to the data, 8 more for the cached hash storage, and 8 more for a (initially NULL) pointer to an encoded (`str`) version of the data for use with the buffer protocol hacks Py2 allows. Add on the four bytes for the array of characters (which for the empty `unicode`, is just the NUL terminator), and that's 52 bytes before storing any real data. – ShadowRanger Jul 27 '17 at 04:12