2

In one of my programs I use a sparse array of data, which is currently implemented as an integer-indexed dict like this:

{
   0: {some dict with data},
   1: {some similar but yet different dict},
   10: {...},
   100: {...},
   200: {...},
   etc
}

It turned out that this dict takes too much memory for my purposes. Is there a way to store sparse arrays more efficiently? I'm ready to sacrifice access time milliseconds for the sake of less memory consumption. The key range is 0..0xFFFFFF, the sparseness is about 30%.

Although a 3rd party module might be an option, I'm more interested in a pure python solution.

Thanks.

To clarify, inner dicts are not subject to optimisation, I'm only trying to arrange them in a better way. For simplicity, let's pretend I have strings rather than dicts there:

data = {
   0: "foo",
   1: "bar",
   10: "...",
   100: "...",
   200: "...",
   etc
}
gog
  • 10,367
  • 2
  • 24
  • 38
  • Not sure what you mean, you are using exactly as much memory as you have values, not sure how this "takes too much memory", isn't it just the amount of memory you need? Or maybe you are asking about compression? Maybe this is helpful: http://stackoverflow.com/questions/10264874/python-reducing-memory-usage-of-dictionary – bosnjak Apr 10 '14 at 11:17
  • @Lawrence: That's not necessarily true: dictionaries have some overhead. As a simple example, compare `sys.getsizeof([("a", 1), ("b", 2)])` to `sys.getsizeof({"a": 1, "b": 2})` – David Robinson Apr 10 '14 at 11:22
  • @DavidRobinson: yes, but the overhead is quite minimum, and it most likely can not get smaller, not with pure python. Maybe if there is a pattern in the data, something could be achieved by different arrangement or a custom 'compression' algorithm... – bosnjak Apr 10 '14 at 11:26
  • 1
    What kind of data is in the inner dicts? The memory consumption of this structure is dominated by the inner dicts, so without knowing more about them, it's impossible to give a useful answer. – Sven Marnach Apr 10 '14 at 11:26
  • @Lawrence: that doesn't necessarily hold depending on how sparse the data is. (In the extreme example, if each inner dict held one item, switching each to a list of tuples could divide the memory usage by ~3. – David Robinson Apr 10 '14 at 11:29
  • @Lawrence: Actually, contrary to my original intuition this holds even for large inner dictionaries. A dictionary takes up about 4-16 times as much memory as a list of tuples. Showing work: `def f(l): return float(sys.getsizeof(dict(l))) / sys.getsizeof(l)`, `[f([(i, "a") for i in range(n)]) for n in range(100, 10000, 100)]`. – David Robinson Apr 10 '14 at 11:35
  • I wasn't considering to switching to something other than dictionary, I didn't understand that was the question, rather the optimization of dictionary itself... In such case, by all means. – bosnjak Apr 10 '14 at 11:37
  • I agree with @DavidRobinson the first thing to try is a list-of-tuples. If that doesn't work, I don't think Python really offers a solution, unless you can somehow optimise your inner dicts. And I suspect those dominate, even if they are small. – Adrian Ratnapala Apr 10 '14 at 11:47
  • `To clarify, inner dicts are not subject to optimisation` This seems to be giving up quite quickly. – David Robinson Apr 10 '14 at 11:50
  • Why are the "inner dicts not subject to optimisation"? They do dominate your memory consumption, and if you are not willing to replace them by anything else, there's really not much you can do (except for not keeping all the data in memory in the first place). – Sven Marnach Apr 10 '14 at 12:45
  • @SvenMarnach: The point of my question is the following: my structure only has integer keys in a limited range, while generic dicts are optimised for all kinds of keys. Obviously, being generic causes a lot of overhead, therefore I believe I could implement my own "dict" more efficiently than the built-in one. Essentially, I'm looking for a good hash function that would allow me to store my values in a python list and, given a key, generate the list index from it. What kind of values I'm going to store is not important. – gog Apr 10 '14 at 13:07
  • 1
    I get that this is what you are hoping for. The point I'm trying to make is that you can't possibly gain anything this way. At least 95 % of your memory consumption is by your inner dicts. Even if you managed to squeeze the consumption of the outer dict by a factor of ten, your total memory consumption would only be reduced to 95.5 % at best, which would hardly help. – Sven Marnach Apr 10 '14 at 14:41

1 Answers1

3

If the structure is mapping, then a dict-like object is really the right option, and if memory is an issue then the obvious solution is to work off a file instead. The easiest approach may be use a pandas Series, which can used as dict and can work directly through an HDF5 file (see http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables)

Alternatively, for a pure python solution, you could use the shelve module.

aquavitae
  • 17,414
  • 11
  • 63
  • 106