32

Suppose I have a multi-level dictionary like this

mydict = {
    'first': {
        'second': {
            'third': {
                'fourth': 'the end'
             }
         }
     }
}

I'd like to access it like this

test = get_entry(mydict, 'first.second.third.fourth')

What I have so far is

def get_entry(dict, keyspec):
    keys = keyspec.split('.')

    result = dict[keys[0]]
    for key in keys[1:]:
       result = dict[key]

    return result

Are there more efficient ways to do it? According to %timeit the runtime of the function is 1.26us, while accessing the dictionary the standard way like this

foo = mydict['first']['second']['third']['fourth']

takes 541ns. I'm looking for ways to trim it to 800ns range if possible.

Thanks

cs95
  • 379,657
  • 97
  • 704
  • 746
Rick Manix
  • 445
  • 3
  • 7
  • 1
    Are all your intermediary dictionaries of length one? If they are, you can use a tuple key fairly efficiently. – jpp Apr 20 '18 at 16:29
  • 1
    this throws `KeyError: 'second'` for me – JacobIRR Apr 20 '18 at 16:31
  • @theausome - that answer "...doesn't seem to work on nested dicts." – JacobIRR Apr 20 '18 at 16:37
  • You have to make a few trade-offs if you want to boost performance. What is more likely to change more often - the dictionary you're traversing or the dot notation string you use to traverse? If both are frequently changing and of the same importance you're not going to get much faster than presented in @tdelaney solution. – zwer Apr 25 '18 at 15:46
  • Relevant: https://stackoverflow.com/questions/14692690/access-nested-dictionary-items-via-a-list-of-keys – Chris_Rands Apr 27 '18 at 12:03

7 Answers7

13

I got a 20% performance boost by tightening up the code a bit but a whopping 400% increase by using a cache for split strings. That only makes a difference if you use the same spec multiple times. Here are sample implementations and a profile script to test.

test.py

mydict = {
    'first': {
        'second': {
            'third': {
                'fourth': 'the end'
             }
         }
     }
}

# original
def get_entry(dict, keyspec):
    keys = keyspec.split('.')

    result = dict[keys[0]]
    for key in keys[1:]:
       result = result[key]

    return result

# tighten up code
def get_entry_2(mydict, keyspec):
    for key in keyspec.split('.'):
        mydict = mydict[key]
    return mydict

# use a cache
cache = {}
def get_entry_3(mydict, keyspec):
    global cache
    try:
        spec = cache[keyspec]
    except KeyError:
        spec = tuple(keyspec.split('.'))
        cache[keyspec] = spec

    for key in spec:
        mydict = mydict[key]
    return mydict

if __name__ == "__main__":
    test = get_entry(mydict, 'first.second.third.fourth')
    print(test)

profile.py

from timeit import timeit
print("original get_entry")
print(timeit("get_entry(mydict, 'first.second.third.fourth')",
    setup="from test import get_entry, mydict"))

print("get_entry_2 with tighter code")
print(timeit("get_entry_2(mydict, 'first.second.third.fourth')",
    setup="from test import get_entry_2, mydict"))

print("get_entry_3 with cache of split spec")
print(timeit("get_entry_3(mydict, 'first.second.third.fourth')",
    setup="from test import get_entry_3, mydict"))

print("just splitting a spec")
print(timeit("x.split('.')", setup="x='first.second.third.fourth'"))

The timing on my machine is

original get_entry
4.148535753000033
get_entry_2 with tighter code
3.2986323120003362
get_entry_3 with cache of split spec
1.3073233439990872
just splitting a spec
1.0949148639992927

Notice that splitting the spec is a comparatively expensive operation for this function. That's why caching helps.

tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • 3
    Looks like you're the only one who paid attention to performance. – kabanus Apr 20 '18 at 20:38
  • @kabanus I don't get what you mean. You can get nano-second level performance with my solution as long as you pre-process your data once. Whether that can be done or not is on OP, not me. – cs95 Apr 21 '18 at 01:51
  • @COLDSPEED I think the choice between yours and mine is whether lots of queries are done on one dataset or a few queries are done on many datasets. – tdelaney Apr 21 '18 at 02:03
  • Yes, there are tradeoffs :) – cs95 Apr 21 '18 at 02:05
  • @cᴏʟᴅsᴘᴇᴇᴅ yes :) I was biased against you because it seems like cheating, but looking back, I guess I was just jealous. – kabanus Apr 21 '18 at 03:21
  • Thanks, I actually forgot about caching and the tighter loop somehow skipped my mind. Are there any way I can measure the cache reqs in terms of space / bytes needed? I was thinking to pre-cache the keys when loading the dictionaries from file – Rick Manix Apr 21 '18 at 10:25
  • As for memory usage, one trick is to get process size just before creating the cache and again when you are done. See https://stackoverflow.com/questions/938733/total-memory-used-by-python-process for getting size. You can use `sys.getsizeof()` but for a container, you have to enumerate and getsizeof child object also. There is also https://pypi.org/project/memory_profiler/ but I've never used it. – tdelaney Apr 21 '18 at 14:58
  • Your 400% increase is misleading since you only test with a single key, so you're guaranteed to have a cache hit every single time after the first hit. – ilmarinen Apr 28 '18 at 21:13
  • @ilmarinen - Its not misleading. I was clear that this only applies when you use a spec multiple times. If you have more specs you need to use them more often to see the benefit. One split amortized over 10 uses would be the same as two splits over 20 uses. I focused on what mattered. – tdelaney Apr 28 '18 at 22:21
  • @ilmarinen - here is another way to look at it. `timeit` ran a million iterations. 1 with a cache miss and a split, and 999999 with a cache hit. So I'm really just measuring the cache hit case. The 400% increase holds for the second and following use of a given spec. – tdelaney Apr 28 '18 at 22:42
12

There's really only one solution. Rebuild your dictionary. But do it just once.

def recursive_flatten(mydict):
    d = {}
    for k, v in mydict.items():
        if isinstance(v, dict):
            for k2, v2 in recursive_flatten(v).items():
                d[k + '.' + k2] = v2 
        else:
            d[k] = v
    return d

In [786]: new_dict = recursive_flatten(mydict); new_dict
Out[786]: {'first.second.third.fourth': 'the end'}

(Some more tests)

In [788]: recursive_flatten({'x' : {'y' : 1, 'z' : 2}, 'y' : {'a' : 5}, 'z' : 2})
Out[788]: {'x.y': 1, 'x.z': 2, 'y.a': 5, 'z': 2}

In [789]: recursive_flatten({'x' : 1, 'y' : {'x' : 234}})
Out[789]: {'x': 1, 'y.x': 234}

Every access becomes constant time from here on.

Now, just access your value using new_dict['first.second.third.fourth']. Should work for any arbitrarily nested dictionary that does not contain a self-reference.

Note that every solution has its fair share of tradeoffs, this is no exception. Unless you're firing millions of queries at your data such that preprocessing is an acceptable overhead, then this is it. With the other solutions, you are only sidestepping the problem instead of addressing it - which is dealing with the dictionary's structure. OTOH, if you're going to do this once on many such similar data structures, it make no sense to preprocess just for a single query, in which case you may prefer one of the other solutions.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 2
    Just a note that this seems to only allow access to the final level of nesting, you wouldn't for example be able to access `new_dict['first.second']` – user3483203 Apr 20 '18 at 16:42
  • @chrisz If needed, that can be fixed by caching `res = recursive_flatten(v)`, updating `d` with `d.update(res)`, and _then_ iterating over `res` in a similar manner. – cs95 Apr 20 '18 at 16:43
  • Using a `dict` directly really is the only fast solution. – kabanus Apr 21 '18 at 03:25
  • 2
    Though in terms of space, your (extended in comments) solution would not scale nicely (read linearly). – kabanus Apr 21 '18 at 03:31
  • I believe this could be a good dupe target, but since you placed the bounty, I thought to ask? https://stackoverflow.com/questions/14692690/access-nested-dictionary-items-via-a-list-of-keys – Chris_Rands Apr 27 '18 at 08:27
  • @Chris_Rands I don't believe that's an appropriate target because performance isn't the prime focus there.... – cs95 Apr 27 '18 at 09:17
  • Someone should at least benchmark Martijn's solution, which I find very elegant https://stackoverflow.com/a/14692747/6260170 – Chris_Rands Apr 27 '18 at 09:21
  • I believe you can achieve the multiple level access @chrisz suggests by simply moving `d[k] = v` out of the `else` clause and let it happen regardless. – cdlane Apr 29 '18 at 07:08
11

I updated the answer from How to use a dot "." to access members of dictionary? to use an initial conversion which will then work for nested dictionaries:

You can use the following class to allow dot-indexing of dictionaries:

class dotdict(dict):
    """dot.notation access to dictionary attributes"""
    __getattr__ = dict.get
    __setattr__ = dict.__setitem__
    __delattr__ = dict.__delitem__

However, this only supports nesting if all nested dictionaries are also of type dotdict. That's where the following helper function comes in:

def dct_to_dotdct(d):
    if isinstance(d, dict):
        d = dotdict({k: dct_to_dotdct(v) for k, v in d.items()})
    return d

This function has to be run once on your nested dictionary, and the result can then be indexed using dot-indexing.

Here are some examples:

In [13]: mydict
Out[13]: {'first': {'second': {'third': {'fourth': 'the end'}}}}

In [14]: mydict = dct_to_dotdct(mydict)

In [15]: mydict.first.second
Out[15]: {'third': {'fourth': 'the end'}}

In [16]: mydict.first.second.third.fourth
Out[16]: 'the end'

A note about performance: this answer is slow compared to standard dictionary access, I just wanted to present an option that actually used "dot access" to a dictionary.

user3483203
  • 50,081
  • 9
  • 65
  • 94
7

Here is a solution similar to chrisz's, but you do not have to anything to your dict a-prior. :

class dictDotter(dict):
    def __getattr__(self,key):
        val = self[key]
        return val if type(val) != dict else dictDotter(val)

and just x=dictDotter(originalDict) will let you have arbitrary dot getting (`x.first.second...). I'll note this is twice as slow as chrisz solution, and his is 9 times as slow as yours (on my machine, approximately).

So, if you insist on making this work @tdelaney seems to have provided the only real performance improvement.

Another option that does better than what you have (in terms of run time):

class dictObjecter:
    def __init__(self,adict):
        for k,v in adict.items():
            self.__dict__[k] = v
            if type(v) == dict: self.__dict__[k] = dictObjecter(v)

which will make an object out of your dict, so dot notation is usual. This will improve run time to 3 times what you have, so not bad, but at the cost of going over your dict, and replacing it with something else.

Here is the total testing code:

from timeit import timeit

class dictObjecter:
    def __init__(self,adict):
        for k,v in adict.items():
            self.__dict__[k] = v
            if type(v) == dict: self.__dict__[k] = dictObjecter(v)

class dictDotter(dict):
    def __getattr__(self,key):
        val = self[key]
        return val if type(val) != dict else dictDotter(val)

def get_entry(dict, keyspec):
    keys = keyspec.split('.')

    result = dict[keys[0]]
    for key in keys[1:]:
        result = result[key]

    return result

class dotdict(dict):
    """dot.notation access to dictionary attributes"""
    __getattr__ = dict.get
    __setattr__ = dict.__setitem__
    __delattr__ = dict.__delitem__

def dct_to_dotdct(d):
    if isinstance(d, dict):
        d = dotdict({k: dct_to_dotdct(v) for k, v in d.items()})
    return d

x = {'a':{'b':{'c':{'d':1}}}}
y = dictDotter(x)
z = dct_to_dotdct(x)
w = dictObjecter(x)
print('{:15} : {}'.format('dict dotter',timeit('y.a.b.c.d',globals=locals(),number=1000)))
print('{:15} : {}'.format('dot dict',timeit('z.a.b.c.d',globals=locals(),number=1000)))
print('{:15} : {}'.format('dict objecter',timeit('w.a.b.c.d',globals=locals(),number=1000)))
print('{:15} : {}'.format('original',timeit("get_entry(x,'a.b.c.d')",globals=locals(),number=1000)))
print('{:15} : {:.20f}'.format('best ref',timeit("x['a']['b']['c']['d']",globals=locals(),number=1000)))

I provided the last regular lookup as a best reference.The results on a Windows Ubuntu subsystem:

dict dotter     : 0.0035500000003594323
dot dict        : 0.0017939999997906853
dict objecter   : 0.00021699999979318818
original        : 0.0006629999998040148
best ref        : 0.00007999999979801942

so the is objectified dict is 3 times as slow as a regular dictionary lookup - so if speed is important, why would you want this?

kabanus
  • 24,623
  • 6
  • 41
  • 74
  • No answer here has _actually_ paid attention to performance, including the answer you claimed had. None of these solutions are any good if there are to be millions of accesses - it all adds up. – cs95 Apr 21 '18 at 01:53
  • @cᴏʟᴅsᴘᴇᴇᴅ Hey, at least give me the "nice effort" consideration. I was trying thing that actually need a `.a.b.c.d` to access deeper into the maze. – kabanus Apr 21 '18 at 03:30
  • Okay, you get a "nice effort" consideration from me (+1). I _do_ like your answer, it, like all the other answers, certainly has its merits over mine. – cs95 Apr 21 '18 at 03:50
2

I had the same need, so I created the Prodict.

For your case, you can do it in one line:

mydict = {
    'first': {
        'second': {
            'third': {
                'fourth': 'the end'
             }
         }
     }
}
dotdict = Prodict.from_dict(mydict)
print(dotdict.first.second.third.fourth) # "the end"

After that, use dotdict just like a dict, because it is a subclass of dict:

dotdict.first == dotdict['first'] # True

You can also add more keys dynamically with dot notation:

dotdict.new_key = 'hooray'
print(dotdict.new_key) # "hooray"

It works even if the new keys are nested dictionaries:

dotdict.it = {'just': 'works'}
print(dotdict.it.just)  # "works"

Lastly, if you define your keys beforehand, you get auto completion and auto type conversion:

class User(Prodict):
    user_id: int
    name: str

user = User(user_id="1", "name":"Ramazan")
type(user.user_id) # <class 'int'>
# IDE will be able to auto complete 'user_id' and 'name' properties

UPDATE:

This is the test result for the same code written by @kabanus:

x = {'a': {'b': {'c': {'d': 1}}}}
y = dictDotter(x)
z = dct_to_dotdct(x)
w = dictObjecter(x)
p = Prodict.from_dict(x)

print('{:15} : {}'.format('dict dotter', timeit('y.a.b.c.d', globals=locals(), number=10000)))
print('{:15} : {}'.format('prodict', timeit('p.a.b.c.d', globals=locals(), number=10000)))
print('{:15} : {}'.format('dot dict', timeit('z.a.b.c.d', globals=locals(), number=10000)))
print('{:15} : {}'.format('dict objecter', timeit('w.a.b.c.d', globals=locals(), number=10000)))
print('{:15} : {}'.format('original', timeit("get_entry(x,'a.b.c.d')", globals=locals(), number=10000)))
print('{:15} : {:.20f}'.format('prodict getitem', timeit("p['a']['b']['c']['d']", globals=locals(), number=10000)))
print('{:15} : {:.20f}'.format('best ref', timeit("x['a']['b']['c']['d']", globals=locals(), number=10000)))

And results:

dict dotter     : 0.04535976458466595
prodict         : 0.02860781018446784
dot dict        : 0.019078164088831673
dict objecter   : 0.0017378700050722368
original        : 0.006594238310349346
prodict getitem : 0.00510931794975705289
best ref        : 0.00121740293554022105

As you can see, its performance is between "dict dotter" and "dot dict". Any performance enhancement suggestion will be appreciated.

ramazan polat
  • 7,111
  • 1
  • 48
  • 76
1

The code should be less iterative and more dynamic!!

data

mydict = {
    'first': {
        'second': {
            'third': {
                'fourth': 'the end'
             }
         }
     }
}

Function

def get_entry(dict, keyspec):
    for keys in keyspec.split('.'):
        dict = dict[keys]
    return dict

call the function

res = get_entry(mydict, 'first.second.third.fourth')

this will take less time to execute even it's a dynamic code execution!!

  • I fail to see how this is remotely different from OP's solution which they didn't want. – cs95 Apr 28 '18 at 10:23
  • As you see there's no use of extra variables to store values that leads it to save time to execute and the time difference is in micro seconds so this will be effective when this code will execute a million times by another code. Moreover you can use first, first.second , first.second.third as an arg without changing single line of code. – Vishal Khichadiya Apr 30 '18 at 04:09
  • The extra variable makes near 0 difference whatsoever, I would certainly hope for larger performance gains than this on a million records. – cs95 Apr 30 '18 at 11:36
  • @cᴏʟᴅsᴘᴇᴇᴅ Can you tell me how much time this code will takes if you measure it really!! Because I'm dmm sure that it's very big difference of time when this code will execute with extra variable and without extra variable. – Vishal Khichadiya Apr 30 '18 at 11:58
  • Not nearly as much as the other answers, we'll go with that. – cs95 Apr 30 '18 at 14:17
1

You can use reduce (functools.reduce in python3):

import operator
def get_entry(dct, keyspec):
    return reduce(operator.getitem, keyspec.split('.'), dct)

It is more nicely looking but with a little less perfomance.

Your version timeit:

>>> timeit("get_entry_original(mydict, 'first.second.third.fourth')",
           "from __main__ import get_entry_original, mydict", number=1000000)
0.5646841526031494

with reduce:

>>> timeit("get_entry(mydict, 'first.second.third.fourth')",
           "from __main__ import get_entry, mydict")
0.6140949726104736

As tdelaney notice - split consume almost as much cpu power as getting key in dict:

def split_keys(keyspec):
    keys = keyspec.split('.')

timeit("split_keys('first.second.third.fourth')",
       "from __main__ import split_keys")
0.28857898712158203

Just move string splitting away from get_entry function:

def get_entry(dct, keyspec_list):
    return reduce(operator.getitem, keyspec_list, dct)


timeit("get_entry(mydict, ['first', 'second', 'third', 'fourth'])",
       "from __main__ import get_entry, mydict")
0.37825703620910645
ndpu
  • 22,225
  • 6
  • 54
  • 69