54

I'm trying to load a couple of files into the memory. The files have either of the following 3 formats:

  • string TAB int
  • string TAB float
  • int TAB float.

Indeed, they are ngram statics files, in case this helps with the solution. For instance:

i_love TAB 10
love_you TAB 12

Currently, the pseudocode of I'm doing right now is

loadData(file):
     data = {}
     for line in file:
        first, second = line.split('\t')
        data[first] = int(second) #or float(second)

     return data

To much of my surprise, while the total size of the files in disk is about 21 mb, when loaded into memory the process takes 120 - 180 mb of memory! (the whole python application doesn't load any other data into memory).

There are less than 10 files, most of them would stay stable at about 50-80k lines, except for one file which currently has millions of lines.

So I would like to ask for a technique/data structure to reduce the memory consumption:

  • Any advice for compression techniques?
  • If I still use dict, is there any way to reduce the memory? Is it possible to set the "load factor" as in Java for Python dict?
  • If you have some other data structures, 'm also willing to trade some of the speed to reduce the memory. Nevertheless, this is a time sensitive application so that once the users input their queries, I think it'd be not quite reasonable to take more than a few seconds to return the result. With regard to this, I'm still amazed by how Google manage to do the Google Translate so fast: they must be using a lot of techniques + lots of servers' power?

Thank you very much. I look forward to your advice.

Paul Hoang
  • 1,014
  • 2
  • 11
  • 21
  • 1
    Does any of the data (keys or values) repeat? If so, you can use references to the same objects to save space. – Gabe Apr 22 '12 at 03:10
  • @Gabe: I think the keys don't repeat (or if they do, I'll just let the latest override the previous one). Thanks. – Paul Hoang Apr 22 '12 at 03:12
  • 1
    How much of that 120-180MB of memory is the interpreter itself? – Platinum Azure Apr 22 '12 at 03:17
  • @PlatinumAzure Any suggestions to how I can find that out? Thanks. – Paul Hoang Apr 22 '12 at 03:30
  • @Gabe: The dictionary takes care of repeating keys already, as the keys are hashed. – Joel Cornett Apr 22 '12 at 03:31
  • @hunghuuhoang: What method are you using to determine the size of an the Python dict? It's not something that is easy to determine. – Joel Cornett Apr 22 '12 at 03:32
  • Unfortunately Python is not designed to support this sort of thing natively. The way to do this is to write it as a C module (like NumPy). – Gabe Apr 22 '12 at 03:32
  • 1
    @JoelCornett: The OP's pseudocode implies that each file will be loaded into a separate dictionary, which would allow them to share keys. – Gabe Apr 22 '12 at 03:34
  • 1
    You can use Guppy / Heapy to find the memory break down of a python module. (http://guppy-pe.sourceforge.net/) – Adam Lewis Apr 22 '12 at 03:34
  • @JoelCornett I think I created the wrong impression. The 120-180mb is of the entire pthon process. I haven't tried to measure the size of Python dict. But since the whole app only loads the stats files, I guess it's reasonable to guess the size of dicts from that ( minus the overhead of the python interpreter) – Paul Hoang Apr 22 '12 at 04:07
  • @hunghuuhoang: I see. I'm curious as to what happens when you vary the size of the input file. Does the memory usage go up or down linearly? exponentially? – Joel Cornett Apr 22 '12 at 04:22
  • @PlatinumAzure I've never seen a Python interpreter go above 10 MB before, or a Jython go above 30 MB, so I doubt it's significant. – javawizard Nov 27 '12 at 03:33

6 Answers6

86

I cannot offer a complete strategy that would help improve memory footprint, but I believe it may help to analyse what exactly is taking so much memory.

If you look at the Python implementation of dictionary (which is a relatively straight-forward implementation of a hash table), as well as the implementation of the built-in string and integer data types, for example here (specifically object.h, intobject.h, stringobject.h and dictobject.h, as well as the corresponding *.c files in ../Objects), you can calculate with some accuracy the expected space requirements:

  1. An integer is a fixed-sized object, i.e. it contains a reference count, a type pointer and the actual integer, in total typically at least 12 bytes on a 32bit system and 24 bytes on a 64bit system, not taking into account extra space possibly lost through alignment.

  2. A string object is variable-sized, which means it contains

  • reference count

  • type pointer

  • size information

  • space for the lazily calculated hash code

  • state information (e.g. used for interned strings)

  • a pointer to the dynamic content

    in total at least 24 bytes on 32bit or 60 bytes on 64bit, not including space for the string itself.

  1. The dictionary itself consists of a number of buckets, each containing
  • the hash code of the object currently stored (that is not predictable from the position of the bucket due to the collision resolution strategy used)

  • a pointer to the key object

  • a pointer to the value object

    in total at least 12 bytes on 32bit and 24 bytes on 64bit.

  1. The dictionary starts out with 8 empty buckets and is resized by doubling the number of entries whenever its capacity is reached.

I carried out a test with a list of 46,461 unique strings (337,670 bytes concatenated string size), each associated with an integer — similar to your setup, on a 32-bit machine. According to the calculation above, I would expect a minimum memory footprint of

  • 46,461 * (24+12) bytes = 1.6 MB for the string/integer combinations
  • 337,670 = 0.3 MB for the string contents
  • 65,536 * 12 bytes = 1.6 MB for the hash buckets (after resizing 13 times)

in total 2.65 MB. (For a 64-bit system the corresponding calculation yields 5.5 MB.)

When running the Python interpreter idle, its footprint according to the ps-tool is 4.6 MB. So the total expected memory consumption after creating the dictionary is approximately 4.6 + 2.65 = 7.25 MB. The true memory footprint (according to ps) in my test was 7.6 MB. I guess the extra ca. 0.35 MB were consumed by overhead generated through Python's memory allocation strategy (for memory arenas etc.)

Of course many people will now point out that my use of ps to measure the memory footprint is inaccurate and my assumptions about the size of pointer types and integers on 32-bit and 64-bit systems may be wrong on many specific systems. Granted.

But, nevertheless, the key conclusions, I believe, are these:

  • The Python dictionary implementation consumes a surprisingly small amount of memory
  • But the space taken by the many int and (in particular) string objects, for reference counts, pre-calculated hash codes etc., is more than you'd think at first
  • There is hardly a way to avoid the memory overhead, as long as you use Python and want the strings and integers represented as individual objects — at least I don't see how that could be done
  • It may be worthwhile to look for (or implement yourself) a Python-C extension that implements a hash that stores keys and values as C-pointers (rather than Python objects). I don't know if that exists; but I believe it could be done and could reduce the memory footprint by more than half.
jogojapan
  • 68,383
  • 11
  • 101
  • 131
  • Very helpful analysis indeed! And thank you for the concrete numbers. I have one question: in your "65,536 * 12 bytes = 1.6 MB", how do you get the number "65,536"? Thanks. – Paul Hoang Apr 22 '12 at 07:08
  • 2
    @Paul Hoang I should have explained that better: Starting with 8 entries, and resizing (doubling) 13 times, gives 8*(2^13)=65,536. Resizing only 12 times gives 32,768, so I assume for 46,461 entries it would resize at least a 13th time as a minimum. – jogojapan Apr 22 '12 at 07:29
  • 1
    Integers are allocated from a pool that's allocated in blocks of 1000 bytes, so the average overhead of an integer is very little. – Gabe Apr 23 '12 at 11:51
  • 2
    @Gabe: But they're only interned up to 256; thereafter they're allocated just like any other object. – javawizard Nov 27 '12 at 03:37
  • @javawizard: Yes; I was talking about non-interned integers. – Gabe Nov 27 '12 at 13:44
  • For what it's worth, Pypy will optimize away lists of machine sized ints or strings so there's no memory overhead. – Antimony Aug 25 '13 at 14:21
  • 2
    Good anaylsis. I'm running 64 bit and found that sys.getsizeof('') == 37, and not 60 bytes. Where did you come up with the 60 bytes? – RussellStewart Mar 14 '14 at 22:23
  • 1
    @user2237635 Testing against `sys.getsizeof()` is a good idea -- I didn't do that. I might actually have used Python 2.5 when I wrote this, where `getsizeof` didn't exist. Anyway, my numbers are based on reading the source code and making assumption about the size of basic types like `int`, `short` etc., which of course depends on the platform. Also, the implementation of the Python types has changed over time. Testing with `getsizeof` of my current 64-bit system shows a difference of 5 bytes for the size of `""` and `b""`, respectively, between Python 2.7.5 and Python 3.3. – jogojapan Mar 15 '14 at 07:33
  • 1
    So, for current Python versions, and of course depending on platform and system, you will get different numbers. But 60 seems quite large indeed -- I might have miscalculated the size for some of the data members. – jogojapan Mar 15 '14 at 07:34
  • NUMPY - If you're reading in a set of integers or floats and storing them in an array or multi-dimensional array, you can use the power of Numpy to do so compactly. It allows for specialized datatypes implemented in C that store raw data with no overhead. – Kevin J. Rice Dec 01 '14 at 16:49
  • 1
    link "here" is dead – Brambor Aug 09 '20 at 07:25
8

1) SQLite in memory sounds like a great solution, it'll let you query your data more easily once it's loaded which is a pleasure

sqlite3.connect(':memory:')

2) you probably want a named tuple - I'm pretty sure they're lighter than dictionaries and you can access properties using dot notation (for which I have an aesthetic preference anyway).

http://docs.python.org/dev/library/collections

3) you may want to have a look at Redis: https://github.com/andymccurdy/redis-py

It is FAST and will let you persist things easily, meaning you don't have to load the whole set in every time you want to use it.

4) a trie sounds like a good idea, but adds some theoretical complexity on your end of the work. You can use Redis to implement and store it though, which will boost your speed even further.

But overall, named tuples are probably the trick here.

Moshe Bildner
  • 461
  • 5
  • 9
6

In disk you have just strings, when loading to Python the interpreter has to create a whole structure for each string and for each dictionary, besides the string itself.

There's no way to reduce the memory used by the dicts, but there are other ways to approach the problem. If you're willing to trade some speed for memory, you should consider storing and querying the strings from an SQLite file instead of loading everything to dictionaries in memory.

Pedro Werneck
  • 40,902
  • 7
  • 64
  • 85
  • 1
    Agreed. If it's too large to fit in memory, use a database. – Francis Avila Apr 22 '12 at 03:33
  • @Pedro Werneck Thank you for suggesting SQLite. I've tried PostGres and it's really much slower (than I wish). Do you know if using SQLite is faster than from PostGres or MySQL? – Paul Hoang Apr 22 '12 at 04:09
  • 1
    This will only work if you can keep the whole DB in memory. If you tell SQLite to use memory instead of a file, you may be able to get the speed you need. It's not clear that you'll get better memory usage, though. – Gabe Apr 22 '12 at 04:18
  • 2
    SQLite is several times faster than PostgreSQL and MySQL – Pedro Werneck Apr 22 '12 at 14:14
4

Sounds like a Trie ( http://en.m.wikipedia.org/wiki/Trie) data structure might better suit your desire for memory efficiency.

Update: the memory efficiency of python dict has been raised as an issue, though it was rejected from the standard lib given the availability of third party libraries. See: http://bugs.python.org/issue9520

Garrett
  • 47,045
  • 6
  • 61
  • 50
  • 1
    Good idea! It's also faster to search, too. What about using a DAWG? I think it will cost less memory. – Ray Aug 15 '12 at 02:40
4

If you are trying to store numeric data compactly in python in memory, your best solution is probably Numpy.

Numpy (http://numpy.org) allocates data structures using native C structures. Most of its data structures presume you're storing a single datatype, so this isn't for all situations (you might need to store null, etc.), but it can be very, very, very fast and is about as compact as you could ask for. Lots of science gets done with it (see also: SciPy).

Of course, there is another option: zlib, if you have:

  • Ample CPU cycles, and
  • LOTS of data that won't fit into memory

You could just declare a 'page' of data (however big you want) as an array, read in the data, store it in the array, zip it, then read in some more data, until you have all the data in memory you want.

Then, iterate over the pages, unzip, convert back to an array, and do your operations as needed. For instance:

def arrayToBlob(self, inArray):
    a = array.array('f', inArray)
    return a.tostring()

def blobToArray(self, blob, suppressWarning=False):
    try:
        out = array.array('f', [])
        out.fromstring(blob)
    except Exception, e:
        if not suppressWarning:
            msg = "Exception: blob2array, err: %s, in: %s" % (e, blob)
            self.log.warning(msg)
            raise Exception, msg
    return out

Once you have data as a blob, you can pass this blob to zlib and compress the data. If you have lots of repeated values, this blob can be vastly compressed.

Of course, it's slower than keeping it all uncompressed, but if you can't fit it in memory your choices are limited to begin with.

Even with compression, it might not all fit in memory, at which point you might have to write out the compressed pages as strings or pickles, etc.

Good luck!

Kevin J. Rice
  • 3,337
  • 2
  • 24
  • 23
3

You can replace dict with blist.sorteddict for log(n) access without the memory overhead. It is convenient because it behaves exactly like a dictionary, ie it implements all its methods, so you only have to change the initial type.

ealfonso
  • 6,622
  • 5
  • 39
  • 67