27

Suppose there is a structure like this:

{'key1' : { 'key2' : { .... { 'keyn' : 'value' } ... } } }

Using python, I'm trying to determine advantages/disadvantages of two different approaches:

{'key1' : { 'key2' : { .... { 'keyn' : 'value' } ... } } } # A. nested dictionary
{('key1', 'key2', ...., 'keyn') : 'value'} # B. a dictionary with a tuple used like key

Then I'm interested to know, what is the best (A or B) in terms of:

  • Memory occupation
  • Complexity in insertion (considering collision-avoiding alorithms, etc...)
  • Complexity in find
aioobe
  • 413,195
  • 112
  • 811
  • 826
Sebastiano Merlino
  • 1,273
  • 12
  • 23
  • Complexity to find _what_? A `key` or a `value`? – agf Aug 22 '11 at 13:05
  • In your answer, you don't consider that each value in the exposed problem is identified by an ordered sequence of keys. With your approach you can access to value from each key without using others; and it is not what I want. It should be possible in my example to have two sequence with only a key different from each other that references two different elements. – Sebastiano Merlino Aug 22 '11 at 13:20
  • Yes, I updated my answer once I understood the problem. The `tuple` is almost certainly better. – agf Aug 22 '11 at 13:31

4 Answers4

13

Without going into details (which are highly implementation-dependent anyway and may be invalidated by the next genius to come along and tweak the dictionary implementation):

  • For memory overhead: Each object has some overhead (e.g. refcount and type; an empty object is 8 bytes and an empty tuple is 28 bytes), but hash tables need to store hash, key and value and usually use more buckets than currently needed to avoid collision. Tuples on the other hand can't be resized and don't have collisions, i.e. a N-tuple can simple allocate N pointers to the contained objects and be done. This leads to noticeable differences in memory consumption.
  • For lookup and insertion complexity (the two are identical in this regard): Be it a string or a tuple, collisions are rather unlikely in CPython's dict implementation, and resolved very efficiently. More keys (because you flatten the key space by combining the keys in tuples) may seem to increase the likelihood of collisions, more keys also lead to more buckets (AFAIK the current implementation tries to keep the load factor between 2/3), which in turn makes collision less likely. Also, you don't need more hashing (well, one more function call and some C-level xor-ing for the tuple hash, but that's negible) to get to a value.

You see, there shouldn't be any noticeable difference in performance, although some memory difference. The latter won't be notable though, I think. A one-element dict is 140 bytes, a ten-element tuple is 140 bytes as well (according to Python 3.2 sys.getsizeof). So even with the (already unrealistic, says my gut-feeling) ten-level nesting, you'd have slightly more than one kB of difference - possibly less if the nested dicts have multiple items (depends on the exact load factor). That's too much for a data-crunching application that has hundreds of such data structure im memory, but most objects aren't created that frequently.

You should simply ask yourself which model is more appropriate for your problem. Consider that using a tuple of keys requires you to have all keys for a value available at once, while using nested dictionaries allows getting there incrementally.

Leland Hepworth
  • 876
  • 9
  • 16
  • Insertion complexity may be the same for both, but the nested insertion will actually be much slower -- you may have to create multiple dictionaries, and you have to implicitly or explicitly check for their existence first. – agf Aug 22 '11 at 14:00
  • @agf: There probably will be some constant overhead, yes, but only profiling can show, how much. Consider: (1) Insertion with tuples needs to construct a tuple, unless of course such a tuple is kept around. (2) There are imaginable applications that could benefits from re-using the result of a partial lookup (for example, making multiple insertions at the same level or below a certain level). Assuming of course there *are* (potentially) multiple values per nesting level... using dicts as linked list is indeed wasteful. –  Aug 22 '11 at 14:05
2

Performance Testing

I performed tests for looping, retrieval, and insertion for a nested dictionary and a dictionary with a tuple. They are one level deep, 2000.000 values. I also did retrieval and insertion for the tuple dict with the tuple already created.

These are the results. I think you cannot really bind conclusions to the std dev.

-

keydictinsertion: Mean +- std dev: 615 ms +- 42 ms  
keydictretrieval: Mean +- std dev: 475 ms +- 77 ms  
keydictlooping: Mean +- std dev: 76.2 ms +- 7.4 ms  

nesteddictinsertion: Mean +- std dev: 200 ms +- 7 ms  
nesteddictretrieval: Mean +- std dev: 256 ms +- 32 ms  
nesteddictlooping: Mean +- std dev: 88.5 ms +- 14.4 ms  

Test were the tuple was already created for the keydict  
keydictinsertionprepared: Mean +- std dev: 432 ms +- 26 ms  
keydictretrievalprepared: Mean +- std dev: 267 ms +- 14 ms

-

As you can can see, the nesteddict if often faster than the dict with a single key. Even when giving the keydict a tuple directly without the tuple creation step, insertion was still much slower. It seemed that the additional creation of an inner dict is not so much cost. Defaultdict has probably a fast implementation. Retrieval was actually almost equal when it did not have to create a tuple, the same with looping.

Testing is done with perf from the command line. Scripts are below.

>>>>>>> nesteddictinsertion
python -m perf timeit -v -s "
from collections import defaultdict
" " 
d = defaultdict(dict)
for i in range(2000):
    for j in range(1000):
        d[i][j] = 1
"
>>>>>>> nesteddictlooping
python -m perf timeit -v -s "
from collections import defaultdict
d = defaultdict(dict)
for i in range(2000):
    for j in range(1000):
        d[i][j] = 1
" "
for i, inner_dict in d.items():
    for j, val in inner_dict.items():
        i
        j
        val
"
>>>>>>> nesteddictretrieval
python -m perf timeit -v -s "
from collections import defaultdict
d = defaultdict(dict)
for i in range(2000):
    for j in range(1000):
        d[i][j] = 1
" "
for i in range(2000):
    for j in range(1000):
        d[i][j]
"
>>>>>>> keydictinsertion
python -m perf timeit -v -s "
from collections import defaultdict
" " 
d = {}
for i in range(2000):
    for j in range(1000):
        d[i, j] = 1
"
>>>>>>> keydictinsertionprepared
python -m perf timeit -v -s "
from collections import defaultdict
keys = [(i, j) for i in range(2000) for j in range(1000)]
" " 
d = {}
for key in keys:
    d[key] = 1
"
>>>>>>> keydictlooping
python -m perf timeit -v -s "
from collections import defaultdict
d = {}
for i in range(2000):
    for j in range(1000):
        d[i, j] = 1
" "
for key, val in d.items():
    key
    val
"
>>>>>>> keydictretrieval
python -m perf timeit -v -s "
from collections import defaultdict
d = {}
for i in range(2000):
    for j in range(1000):
        d[i, j] = 1
" "
for i in range(2000):
    for j in range(1000):
        d[i, j]
"
>>>>>>> keydictretrievalprepared
python -m perf timeit -v -s "
from collections import defaultdict
d = {}
keys = [(i, j) for i in range(2000) for j in range(1000)]
for key in keys:
    d[key] = 1
" "
for key in keys:
    d[key]
"
Hielke Walinga
  • 2,677
  • 1
  • 17
  • 30
2

If you need to use the whole combination of key1 to keyn to get value, you can flip the dict like I suggested below for O(nk*nv) (number of keys * number of values) or use the tuple method above.

Assuming you need to build the tuple on insertion and again when you need to get a value, both will be O(nk) where nk is the number of keys.

The nested dict version might be more space efficient if you're nested fairly deeply (there are lots of values that share a partial ordered list of keys), and getting a value will still be O(nk), but probably slower than the tuple version.

However, insertion will be slower, though I can't quantify it's speed. You will have to construct at least one layer of dict for each insertion, and test for the existence of dicts at the previous levels.

There are many recipes for recursive defaultdicts that would simplify insertion from a coding perspective, but it wouldn't actually speed things up.

The tuple method is both simpler and faster for insertion, but may take more memory depending on your nesting.


Original answer from before I knew the requirements

Why not

{'key1' : 'value', 'key2' : 'value', .... 'keyn' : 'value' } 

It's just storing a reference to value at each location, not value itself, so the memory usage would be less than the nested dict version, and not much larger than the tuple version, unless you have an extremely large number of values.

For time complexity of Python standard type operations, see the Python wiki.

Basically, insertion or getting one item on average is O(1).

Getting all the keys for a value would on average be O(n):

[key for key in mydict if mydict[key] == value]

However, if adding keys, or searching for all keys, is the usual operation, your dict is flipped. You want:

{'value': [key1, key2, ... keyn]}

This way you can add keys with just mydict[value].append(newkey) and get all the keys with just mydict[value] and both will be on average O(1).

Community
  • 1
  • 1
agf
  • 171,228
  • 44
  • 289
  • 238
  • 2
    but I need to identify 'value' with the combination of 'key1', 'key2', ... 'keyn'. With this approach I have access to value from each key and not using all keys. – Sebastiano Merlino Aug 22 '11 at 13:17
1

Memory Consumption Testing

I've written a small script to test it. It has some limitations though, the keys are made from integers linearly distirbuted (i.e. range(N)), my findings are the following.

With a 3-level nesting, i.e. dict[a][b][c] vs dict[a,b,c] where each sub index goes from 0 to 99, I find the following:

With large values (list(x for x in range(100))):

> memory.py nested 
Memory usage: 1049.0 MB
> memory.py flat  
Memory usage: 1149.7 MB

and with small values ([]):

> memory.py nested
Memory usage: 134.1 MB
> memory.py flat
Memory usage: 234.8 MB

 Open questions

  • Why is this happening?
  • Would this change with different indices, e.g. non-consecutive ones?

Script

#!/usr/bin/env python3
import resource
import random
import itertools
import sys
import copy
from os.path import basename
from collections import defaultdict
 
# constants
index_levels = [100, 100, 100]
value_size   = 100 # try values like 0

def memory_usage():
    return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

_object_mold = list(x for x in range(value_size)) # performance hack
def create_object():
    return copy.copy(_object_mold)

# automatically create nested dict
# http://code.activestate.com/recipes/578057-recursive-defaultdict/
f = lambda: defaultdict(f)
my_dict = defaultdict(f)

# options & usage
try:
    dict_mode = sys.argv[1]
    if dict_mode not in ['flat', 'nested']: # ugly hack
        raise Error()
except:
    print("Usage: {} [nested | flat]".format(basename(sys.argv[0])))
    exit()
 
index_generator = [range(level) for level in index_levels]

if dict_mode == "flat":
    for index in itertools.product(*index_generator):
        my_dict[index] = create_object()
elif dict_mode == "nested":
    for index in itertools.product(*index_generator):
        sub_dict = my_dict
        for sub_index in index[:-1]:          # iterate into lowest dict
            sub_dict = sub_dict[sub_index]
        sub_dict[index[-1]] = create_object() # finally assign value

print("Memory usage: {:.1f} MB".format(memory_usage() / 1024**2))
Community
  • 1
  • 1
Georg Schölly
  • 124,188
  • 49
  • 220
  • 267