1

My code is creating a dict (with strings as keys and numpy arrays as values), which is too big too fit into the RAM, so the program is crashing ('Cannot allocate memory','killed','aborted').

Having read some SO articles, I got the impression, that I would need to use a database to handle this case. But which one should I use? The bsddb - Interface to Berkeley DB library recommended @ Python Disk-Based Dictionary only accepts strings as values, which does make it seem very cumbersome to use it with numpy arrays. I also looked briefly at sqlite3 recommended @ How to handle Out of memory with Python, but I would really like to avoid using SQL do access my data.

What would you recommend?

Community
  • 1
  • 1
Framester
  • 33,341
  • 51
  • 130
  • 192
  • You could always just store the numpy arrays on disk and make the dictionary `key: file path`. – Katriel Feb 21 '12 at 10:47
  • @katrielalex Thanks, but the data actually comes from many little files, and I try to have them all in one place to do computations on them. See http://stackoverflow.com/questions/9232944/how-to-save-big-not-huge-dictonaries-in-python – Framester Feb 21 '12 at 11:00
  • 1
    from reading your previous questions, it seems you're asking for contradictory things. You want to have all these 1000x1000 arrays in memory (to do computations on them) but you want to have them on disk (because you don't have enough memory). Pick one! – Katriel Feb 21 '12 at 11:04
  • 3
    You could have a look at PyTables (see e.g. http://stackoverflow.com/a/7891137/398968) – Katriel Feb 21 '12 at 11:41
  • 1
    It really sounds like you should be using PyTables. HDF5 is custom made for storing this sort of data, and PyTables is a very elegant, Pythonic way to access HDF5 without too much pain. – talonmies Feb 21 '12 at 22:17

2 Answers2

1

sqlite would seem perfect, given that your query pattern will be very simple.

Another option which I frequently mention is redis ( http://redis.io ), a key-value server.

Memcached ( http://memcached.org/ ) and MongoDB ( http://www.mongodb.org/ ) are other popular NoSQL databases.

If none of these take your fancy, google NoSQL to see what other projects are out there.

Marcin
  • 48,559
  • 18
  • 128
  • 201
  • 2
    Redis and Memcached are in memory, so I don't think they are good options in this case. – ustun Feb 21 '12 at 11:05
0

Here's a simple solution which might work for you. Instead of storing the arrays in the dict (so they're in memory), write them to a file. As long as you're careful with your references, they'll be cleared up by the reference counter until you access them again.

EDIT: You could tweak this by using npz files to store a few keys at a time, especially if you don't need random access.

Code

import tempfile
import numpy

class numpy_dict(dict):
    def __setitem__(self, key, value):
        with tempfile.NamedTemporaryFile(delete=False) as f:
            numpy.save(f, value)
            super(numpy_dict, self).__setitem__(key, f.name)

    def __getitem__(self, key):
        path = super(numpy_dict, self).__getitem__(key)
        return numpy.load(path)

Example usage

>>> import so
>>> import numpy as np
>>> x = so.numpy_dict()
>>> x["a"] = np.zeros((2,2))
>>> x["b"] = np.ones((2,2))
>>> x["a"]
array([[ 0.,  0.],
       [ 0.,  0.]])
>>> x["b"]
array([[ 1.,  1.],
       [ 1.,  1.]])
>>> dict.__getitem__(x, "a")
'/tmp/tmpxIxt0O'
>>> dict.__getitem__(x, "b")
'/tmp/tmpIviN4M'
>>> from sys import getrefcount as refs
>>> x = np.zeros((2,2))
>>> refs(x)
2
>>> x = so.numpy_dict()
>>> y = np.zeros((2,2))
>>> refs(y)
2
>>> x["c"] = y
>>> refs(y)
2
Katriel
  • 120,462
  • 19
  • 136
  • 170