30

There are plenty of questions and discussion about memory consumption of different python data types. Yet few of them (if any) come to a very specific scenario. When you want to store LOTS of key-value data in memory, which data structure is more memory-efficient, a dict or a list of tuples?

At beginning I thought dict is more powerful than list of tuples and that power must come with some price, and actually an empty dict DOES occupy more memory than an empty list or tuple (see In-memory size of a Python structure), so I thought using [(key1, value1), (key2, value2), ...] would be more memory-efficient than {key1: value1, key2: value2, ...}.

Looks like I was wrong. Just fire up the following code snippet, and see the mem consumption reported by your OS. I am using Windows XP so that task manager tells me, a large dict eats up "only" 40MB Ram and 40MB VIRTURAL Ram, but a list of tuples eats up 60MB Ram and 60MB Virtual ram.

How could that be?

from sys import getsizeof as g
raw_input('ready, press ENTER')
i = 1000000
#p = [(x, x) for x in xrange(i)] # Will print 4,348,736 40,348,736
p = dict((x, x) for x in xrange(i)) # Will print 25,165,964 37,165,964
print g(p), g(p) + sum(g(x) for x in p)
raw_input("Check your process's memory consumption now, press ENTER to exit")

Update:

Thanks for some of the comments below. I wanna clarify: I'm talking about memory-efficiency. And no, in this case no need to worry about key-value lookup efficiency, let's just assume my algorithm will consume them one by one via iterator.

Community
  • 1
  • 1
RayLuo
  • 17,257
  • 6
  • 88
  • 73
  • You are asking the wrong question. If you need key-value look up, then go with dict. If you need an array then use list or tuple. – Hai Vu Mar 26 '13 at 15:51
  • Python keeps a hash table for dictionaries. [This link](http://mail.python.org/pipermail/python-list/2000-March/048085.html) was from [another answer](http://stackoverflow.com/questions/114830/is-a-python-dictionary-an-example-of-a-hash-table) I think, then, that dictionaries are faster for lookups, and tuples use less memory. – mbowden Mar 26 '13 at 15:53
  • For some kinds of data you can use something even more optimal than both your options, like a trie. – wRAR Mar 26 '13 at 15:59
  • 1
    Efficient for what? For efficient memory use or for doing fast lookups? – Brian Neal Mar 26 '13 at 16:09

2 Answers2

34

Your list of tuples adds an extra layer. You have 3 layers of items:

  • The outer list of length 1 million, so 1 million pointers
    • 1 million 2-slot tuples, so 2 million pointers
      • 2 million references to 1 million integer values

while your dict only holds:

  • The dict (including 1 million cached hashes) with 2 million pointers + extra space to grow the table
    • 2 million references to 1 million integer values

It's those 1 million tuples plus the list to hold the references to them that take up more memory than the 1 million cached hashes. There are some 50% more pointers involved here, easily accounting for the 50% more memory use you see.

There is another downside to your list of tuples: lookup time. To find a matching key in the dict, there is a O(1) complexity cost. To do the same in the list of tuples, you have to potentially scan the whole list for a O(n) cost. Don't use a list of tuples if you need to map keys to values.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I think you are right about the extra layer thing. So do you think in this case a dict is still most memory-efficient, even if I just want to "hold" those data? (Let's assume I don't need random lookup.) – RayLuo Mar 26 '13 at 17:24
  • @Iceberg: I'd not hold the data. If you are not looking things up in it, what is the point? You can also use a *flat* tuple, so no nesting of pairs; you can still recreate the pairs easily. – Martijn Pieters Mar 26 '13 at 17:28
  • What I want is to hold them and then iterate them. The flat tuple trick might help, with the price of losing readability. – RayLuo Mar 26 '13 at 18:22
  • @Iceberg: Pairwise iteration: [Iterating over every two elements in a list](http://stackoverflow.com/q/5389507) – Martijn Pieters Mar 26 '13 at 18:29
13

You're actually getting an incomplete picture of memory use in this case. The total size of a dictionary more than doubles at irregular intervals, and if you compare the size of these two structures right after the dictionary size has increased, it's bigger again. A simple script with a recursive size function (see code below) shows a pretty clear pattern:

i:  2  list size:  296  dict size:  328  difference:  -32
i:  3  list size:  392  dict size:  352  difference:  40
i:  4  list size:  488  dict size:  376  difference:  112
i:  5  list size:  616  dict size:  400  difference:  216
i:  7  list size:  808  dict size:  1216  difference:  -408
i:  10  list size:  1160  dict size:  1288  difference:  -128
i:  13  list size:  1448  dict size:  1360  difference:  88
i:  17  list size:  1904  dict size:  1456  difference:  448
i:  23  list size:  2480  dict size:  3904  difference:  -1424
i:  31  list size:  3328  dict size:  4096  difference:  -768
i:  42  list size:  4472  dict size:  4360  difference:  112
i:  56  list size:  5912  dict size:  4696  difference:  1216
i:  74  list size:  7880  dict size:  5128  difference:  2752
i:  100  list size:  10520  dict size:  14968  difference:  -4448
i:  133  list size:  14024  dict size:  15760  difference:  -1736
i:  177  list size:  18672  dict size:  16816  difference:  1856

This pattern continues as i grows. (You can test this using your method -- try setting i near 2636744. The size of the dictionary is larger at that point, at least for me.) Martijn is right that the tuples in the list of tuples add to the memory overhead, canceling out the memory advantage of lists over dictionaries. But the result, on average, is not that the dictionary is better; it's that the dictionary is about the same. So in answer to your original question:

When you want to store LOTS of key-value data in memory, which data structure is more memory-efficient, a dict or a list of tuples?

It doesn't really matter if all you're concerned about is memory.

However, note that iterating over a dictionary is often a bit slower than iterating over a list, because there's no good way to avoid iterating over all the empty bins in the dictionary. So there's a bit of a tradeoff -- dictionaries are (much) faster at doing random key lookups, but lists are (a bit) faster at iteration. The dictionary will probably be better most of the time, but in some rare cases, the list may provide a micro-optimization.


Here's the code that tests for size. It probably won't generate correct results for all corner cases, but it should handle simple structures like this without any problems. (But let me know if you see any issues.)

import sys, collections, itertools, math

def totalsize(x):
    seen = set()
    return ts_rec(x, seen)

def ts_rec(x, seen):
    if id(x) in seen:
        return 0
    else:
        seen.add(id(x))

    x_size = sys.getsizeof(x)
    if isinstance(x, collections.Mapping):
        kv_chain = itertools.chain.from_iterable(x.iteritems())
        return x_size + sum(ts_rec(i, seen) for i in kv_chain)
    elif isinstance(x, collections.Sequence):
        return x_size + sum(ts_rec(i, seen) for i in x)
    else:
        return x_size

for i in (10 ** (e / 8.0) for e in range(3, 19)):
    i = int(i)
    lsize = totalsize([(x, x) for x in xrange(i)])
    dsize = totalsize(dict((x, x) for x in xrange(i)))

    print "i: ", i,
    print " list size: ", lsize, " dict size: ", dsize,
    print " difference: ", lsize - dsize
Community
  • 1
  • 1
senderle
  • 145,869
  • 36
  • 209
  • 233
  • Appreciate your trying to help. At least you understand the original question much better than those who comment on my question. So my understanding to your point is "**it depends ... on how many items to hold, and we better have a benchmark if the length is known and fixed**". BTW, I modify your script as `for i in (10 ** (e /8.0) for e in range(3, 49)):`, and see dict always outperform list of tuples when i=42169, 56234, 74989, ... all the way to i=1000000. Oh, thx for mentioning the iterating speed too. – RayLuo Mar 27 '13 at 02:38
  • @Iceberg, yes, that's about what I mean. But I would add that unless you're doing some serious micro-optimization, the benchmark isn't really worth the trouble; use the structure that makes practical sense for your problem. And on the other hand, if you're doing micro-optimization and you don't care about random key access, then you'll probably get the best result from a flat list, as suggested by Martijn. – senderle Mar 27 '13 at 13:00
  • When saying "micro-optimization", people might hint an effort unworthy? But in this memory-oriented case, the plus/minus difference can range from 2% to 71%. It **IS** significant! Besides, dict is semantically similar to a list of tuples, but not to a flat list. All in all, now we know all the pros and cons, so we can choose either of them when we see it fit in a certain situation. Thanks to all people contribute to this thread! – RayLuo Mar 28 '13 at 15:37