120

I have a dictionary like this:

{
    "id": "abcde",
    "key1": "blah",
    "key2": "blah blah",
    "nestedlist": [
        {
            "id": "qwerty",
            "nestednestedlist": [
                {
                    "id": "xyz",
                    "keyA": "blah blah blah"
                },
                {
                    "id": "fghi",
                    "keyZ": "blah blah blah"
                }
            ],
            "anothernestednestedlist": [
                {
                    "id": "asdf",
                    "keyQ": "blah blah"
                },
                {
                    "id": "yuiop",
                    "keyW": "blah"
                }
            ]
        }
    ]
}

Basically a dictionary with nested lists, dictionaries, and strings, of arbitrary depth.

What is the best way of traversing this to extract the values of every "id" key? I want to achieve the equivalent of an XPath query like "//id". The value of "id" is always a string.

So from my example, the output I need is basically:

["abcde", "qwerty", "xyz", "fghi", "asdf", "yuiop"]

Order is not important.

funnydman
  • 9,083
  • 4
  • 40
  • 55
Matt Swain
  • 3,827
  • 4
  • 25
  • 36
  • **See also:** https://stackoverflow.com/questions/7681301/search-for-a-key-in-a-nested-python-dictionary https://stackoverflow.com/a/16508328/42223 – dreftymac Oct 30 '17 at 19:55
  • Most of your solutions blow up if we pass `None` as input. Do you care about robustness? (since this is now being used as canonical question) – smci May 07 '18 at 05:35

13 Answers13

106

I found this Q/A very interesting, since it provides several different solutions for the same problem. I took all these functions and tested them with a complex dictionary object. I had to take two functions out of the test, because they had to many fail results and they did not support returning lists or dicts as values, which i find essential, since a function should be prepared for almost any data to come.

So i pumped the other functions in 100.000 iterations through the timeit module and output came to following result:

0.11 usec/pass on gen_dict_extract(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
6.03 usec/pass on find_all_items(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0.15 usec/pass on findkeys(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1.79 usec/pass on get_recursively(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0.14 usec/pass on find(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0.36 usec/pass on dict_extract(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -

All functions had the same needle to search for ('logging') and the same dictionary object, which is constructed like this:

o = { 'temparature': '50', 
      'logging': {
        'handlers': {
          'console': {
            'formatter': 'simple', 
            'class': 'logging.StreamHandler', 
            'stream': 'ext://sys.stdout', 
            'level': 'DEBUG'
          }
        },
        'loggers': {
          'simpleExample': {
            'handlers': ['console'], 
            'propagate': 'no', 
            'level': 'INFO'
          },
         'root': {
           'handlers': ['console'], 
           'level': 'DEBUG'
         }
       }, 
       'version': '1', 
       'formatters': {
         'simple': {
           'datefmt': "'%Y-%m-%d %H:%M:%S'", 
           'format': '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
         }
       }
     }, 
     'treatment': {'second': 5, 'last': 4, 'first': 4},   
     'treatment_plan': [[4, 5, 4], [4, 5, 4], [5, 5, 5]]
}

All functions delivered the same result, but the time differences are dramatic! The function gen_dict_extract(k,o) is my function adapted from the functions here, actually it is pretty much like the find function from Alfe, with the main difference, that i am checking if the given object has iteritems function, in case strings are passed during recursion:

# python 2
def gen_dict_extract(key, var):
    if hasattr(var,'iteritems'): # hasattr(var,'items') for python 3
        for k, v in var.iteritems(): # var.items() for python 3
            if k == key:
                yield v
            if isinstance(v, dict):
                for result in gen_dict_extract(key, v):
                    yield result
            elif isinstance(v, list):
                for d in v:
                    for result in gen_dict_extract(key, d):
                        yield result

So this variant is the fastest and safest of the functions here. And find_all_items is incredibly slow and far off the second slowest get_recursivley while the rest, except dict_extract, is close to each other. The functions fun and keyHole only work if you are looking for strings.

Interesting learning aspect here :)

mbh86
  • 6,078
  • 3
  • 18
  • 31
hexerei software
  • 3,100
  • 2
  • 15
  • 19
  • This is the right answer for anyone but the OP's narrow case. This has worked on every nasty thing that [jsonpickle](http://jsonpickle.readthedocs.io/en/latest/) can throw at it. – Bruno Bronosky Feb 20 '17 at 04:22
  • 1
    If you want to search for multiple keys as I did, just: (1) change to `gen_dict_extract(keys, var)` (2) put `for key in keys:` as line 2 & indent the rest (3) change the first yield to `yield {key: v}` – Bruno Bronosky Feb 20 '17 at 04:22
  • 7
    You're comparing apples to oranges. Running a function that returns a generator takes less time than running a function that returns a finished result. Try timeit on `next(functionname(k, o)` for all the generator solutions. – kaleissin Jul 18 '17 at 13:10
  • 17
    `hasattr(var, 'items')` for python3 – o-90 Sep 08 '17 at 16:16
  • 1
    Did you consider to strip the `if hasattr` part for a version using `try` to catch the exception in case the call fails (see https://pastebin.com/ZXvVtV0g for a possible implementation)? That would reduce the doubled lookup of the attribute `iteritems` (once for `hasattr()` and once for the call) and thus probably reduce the runtime (which seems important to you). Didn't make any benchmarks, though. – Alfe Nov 09 '17 at 09:56
  • 1
    I have to agree with @kaleissin on this. Those timeit results are absolutely meaningless. It might have been the same dict and the same search term, but generators by their nature do *nothing* until you start iterating through them. If the dict object had 1 entry, or 99 Quadrillion your generator would "finish" in the exact same amount of time, because all it did was find the first result and then it stopped processing. The other solutions, namely "get_recursively", **ACTUALLY** found all occurrences of the key, as the OP had asked. As such, this answer is very misleading. – Stack of Pancakes May 21 '18 at 14:03
  • 1
    I tried `list(gen_dict_extract('logging',o))`, but it returns an empty list, why is this? – xiaoshir Jun 15 '18 at 08:15
  • 5
    For anyone visiting this page now that Python 3 has taken over, remember that `iteritems` has become `items`. – Mike Williamson Jun 19 '18 at 19:10
  • 1
    I do not understand how this function actually solves all cases. Bruno mentions in response to kev's answer that this is the better solution. To me it seems **more** narrow: it does not look through lists (which don't have the `items` or in Py2 `iteritems` attributes). Yet, kev's solution handles this. What am I missing? – Mike Williamson Jun 19 '18 at 19:13
  • @edge27 as Mike Williamson has mentioned, you are most likely using Python 3 and therefore have to use 'iter' instead of 'iteritems' – hexerei software Jun 25 '18 at 14:36
  • @MikeWilliamson Lists are not handled, as they are values in a dictionary and this function only handles returning what ever is set with given key. So it could return a list as value, but searching for a value with given key within a list is not supported – hexerei software Jun 25 '18 at 14:39
  • What is really bad is that you assume the top-level is a dictionary (and not a list), but even worse, you also unnecessarily assume that lists have primitives or dicts as values. Typical case of premature optimisation for speed, when the algorithm doesn't work for all cases (which it easily could) – Anthon Apr 10 '19 at 09:14
  • Are you sure that your function works? - I am using your function and data and it returns empty list. – Outcast Dec 07 '19 at 16:24
  • @PatrickB. totally agree, cleary needs an update after five years, will write an update and add a notebook to allow playing around with as soon as I find the time – hexerei software May 28 '20 at 09:29
  • This example would support lists like ["a":{...}] if added this block at the end: `elif isinstance(var, list): for l in var: for result in gen_dict_extract(key, l): yield result` – Squrppi May 26 '21 at 11:25
  • By the way, you can write yield from gen_dict_extract(key, d), instead of directly iterating over the result – funnydman Jul 11 '22 at 10:24
50
d = { "id" : "abcde",
    "key1" : "blah",
    "key2" : "blah blah",
    "nestedlist" : [ 
    { "id" : "qwerty",
        "nestednestedlist" : [ 
        { "id" : "xyz", "keyA" : "blah blah blah" },
        { "id" : "fghi", "keyZ" : "blah blah blah" }],
        "anothernestednestedlist" : [ 
        { "id" : "asdf", "keyQ" : "blah blah" },
        { "id" : "yuiop", "keyW" : "blah" }] } ] } 


def fun(d):
    if 'id' in d:
        yield d['id']
    for k in d:
        if isinstance(d[k], list):
            for i in d[k]:
                for j in fun(i):
                    yield j

>>> list(fun(d))
['abcde', 'qwerty', 'xyz', 'fghi', 'asdf', 'yuiop']
kev
  • 155,172
  • 47
  • 273
  • 272
  • The only thing I would change is `for k in d` to `for k,value in d.items()` with the subsequent use of `value` instead of `d[k]`. – ovgolovin Mar 21 '12 at 15:49
  • Thanks, this works great. Required very slight modification because my lists can contain strings as well as dicts (which I didn't mention), but otherwise perfect. – Matt Swain Mar 21 '12 at 15:59
  • 1
    This fits a very narrow case, you owe it to yourself to consider the answer from "hexerei software" called `gen_dict_extract` – Bruno Bronosky Feb 20 '17 at 04:24
  • I got the error "TypeError: argument of type 'NoneType' is not iterable" – xiaoshir Jun 15 '18 at 08:11
  • For anyone visiting this page now that Python 3 has taken over, remember that `iteritems` has become `items`. – Mike Williamson Jun 19 '18 at 19:11
  • 3
    This solution doesn't seem to support lists – Alex R Feb 26 '19 at 04:49
  • @AlexR Indeed it doesn't. If you do `d['nestedlist'] = [d['nestedlist']]` you'll get a TypError. This answer incorrectly, and unnecessarily, assumes lists cannot be either at the root level of the data structure and also that they cannot be directly nested. – Anthon Apr 10 '19 at 09:37
  • I think that this solution does not support (nested) lists. – Outcast Dec 07 '19 at 16:23
41
d = { "id" : "abcde",
    "key1" : "blah",
    "key2" : "blah blah",
    "nestedlist" : [
    { "id" : "qwerty",
        "nestednestedlist" : [
        { "id" : "xyz", "keyA" : "blah blah blah" },
        { "id" : "fghi", "keyZ" : "blah blah blah" }],
        "anothernestednestedlist" : [
        { "id" : "asdf", "keyQ" : "blah blah" },
        { "id" : "yuiop", "keyW" : "blah" }] } ] }


def findkeys(node, kv):
    if isinstance(node, list):
        for i in node:
            for x in findkeys(i, kv):
               yield x
    elif isinstance(node, dict):
        if kv in node:
            yield node[kv]
        for j in node.values():
            for x in findkeys(j, kv):
                yield x

print(list(findkeys(d, 'id')))
arainchi
  • 1,352
  • 13
  • 12
  • 3
    This example worked with every complex dictionary I tested. Well done. –  Feb 04 '19 at 03:04
  • 1
    This should be the accepted answer, it can find keys that are within dictionaries that are nested within list of lists etc. – Anthon Apr 10 '19 at 09:10
  • 1
    This works in Python3 as well, as long as the print statement at the end is modified. None of the solutions above this worked for an API response with lists nested inside dicts listed inside lists, etc, but this one worked beautifully. – Sophie Aug 27 '19 at 00:52
  • 1
    This solution is epic! Really delivered for me, thank you! – Leonardo Wildt Mar 15 '23 at 14:24
28
def find(key, value):
  for k, v in value.items():
    if k == key:
      yield v
    elif isinstance(v, dict):
      for result in find(key, v):
        yield result
    elif isinstance(v, list):
      for d in v:
        for result in find(key, d):
          yield result

EDIT: @Anthon noticed that this will not work for directly nested lists. If you have this in your input, you can use this:

def find(key, value):
  for k, v in (value.items() if isinstance(value, dict) else
               enumerate(value) if isinstance(value, list) else []):
    if k == key:
      yield v
    elif isinstance(v, (dict, list)):
      for result in find(key, v):
        yield result

But I think the original version is easier to understand, so I will leave it.

Feline
  • 773
  • 3
  • 14
Alfe
  • 56,346
  • 20
  • 107
  • 159
  • You can strip the dict-elif-branch if you like. your case doesn't seem to have these. – Alfe Mar 21 '12 at 15:50
  • 1
    This works great as well, but likewise runs into issues if it encounters a list that directly contains a string (which I forgot to include in my example). I think adding in an `isinstance` check for a `dict` before the last two lines solves this. – Matt Swain Mar 21 '12 at 16:15
  • @Alfe congratulations! see my answer, where i speed tested all functions, your function is outstanding, only overtaken by my adaption to your function, where i checked if given object has `iteritems()` function – hexerei software Apr 15 '15 at 14:50
  • 1
    Thanks for the accolades, but I'd be prouder to get them for the cleanliness of my code than for its speed. – Alfe Apr 16 '15 at 08:10
  • 1
    95% of the time, yes. The remaining (rare) occasions are the ones in which some time limitation might force me to choose a faster version over a cleaner one. But I don't like this. It always means to put a load of work onto my successor who will have to maintain that code. It is a risk because my successor might get confused. I will have to write a lot of comments then, maybe a whole document explaining my motivations, timing experiments, their results etc. That's way more work for me and all colleagues to get it done properly. Cleaner is way simpler. – Alfe Dec 05 '16 at 14:37
  • 2
    @Alfe - thanks for this answer. I had a need to extract all occurences of a string in a nested dict for a specific use case of Elasticsearch and this code was useful with a minor modification - https://stackoverflow.com/questions/40586020/get-the-number-of-fields-on-an-index/51692012#51692012 – Saurabh Hirani Aug 05 '18 at 06:48
  • 1
    This completely **breaks** on lists directly contained in lists. – Anthon Apr 10 '19 at 09:41
  • @Anthon You are right. I could change the implementation to also support nested lists but then it gets more complex and harder to understand, and the OP's case didn't have those anyway. But thanks for point that out anyway. – Alfe Apr 10 '19 at 13:02
  • I am not sure why you think it gets **more** complex, as in fact the character count would decrease by at least 5% and your line-count would stay the same for that function. It is just re-arranging of the lines you have, indenting accordingly, and changing one `elif` to `if`. – Anthon Apr 10 '19 at 13:26
  • I added another version solving your issue of directly nested lists. But I still believe the original version is easier to understand. – Alfe Apr 10 '19 at 15:31
  • I think that your second version indeed works for nested lists - well done ;) – Outcast Dec 07 '19 at 16:29
17

pip install nested-lookup does exactly what you are looking for:

document = [ { 'taco' : 42 } , { 'salsa' : [ { 'burrito' : { 'taco' : 69 } } ] } ]

>>> print(nested_lookup('taco', document))
[42, 69]
VengaVenga
  • 680
  • 1
  • 10
  • 13
  • 3
    Works the same as @arainchi answer above, plus there's a bunch of other functions in [nested-lookup](https://pypi.org/project/nested-lookup/) library. For the above example use `from nested_lookup import nested_lookup` – MiKK Jan 03 '22 at 22:44
9

I just wanted to iterate on @hexerei-software's excellent answer using yield from and accepting top-level lists.

def gen_dict_extract(var, key):
    if isinstance(var, dict):
        for k, v in var.items():
            if k == key:
                yield v
            if isinstance(v, (dict, list)):
                yield from gen_dict_extract(v, key)
    elif isinstance(var, list):
        for d in var:
            yield from gen_dict_extract(d, key)
Chris
  • 531
  • 5
  • 11
  • Excellent mod to @hexerei-software's answer: succinct and allows list-of-dicts! I'm using this along with @bruno-bronosky's suggestions in his comments to use `for key in keys`. Also I added to the 2nd `isinstance` to `(list, tuple)` for even *more* variety. ;) – Cometsong Feb 15 '19 at 14:44
  • From a suggested edit: "Note: this appears to be about 50 times slower than the other gen_dict_extract" – pppery Nov 05 '22 at 16:42
6

This function recursively searches a dictionary containing nested dictionaries and lists. It builds a list called fields_found, which contains the value for every time the field is found. The 'field' is the key I'm looking for in the dictionary and its nested lists and dictionaries.

def get_recursively(search_dict, field):
    """Takes a dict with nested lists and dicts,
    and searches all dicts for a key of the field
    provided.
    """
    fields_found = []

    for key, value in search_dict.iteritems():

        if key == field:
            fields_found.append(value)

        elif isinstance(value, dict):
            results = get_recursively(value, field)
            for result in results:
                fields_found.append(result)

        elif isinstance(value, list):
            for item in value:
                if isinstance(item, dict):
                    more_results = get_recursively(item, field)
                    for another_result in more_results:
                        fields_found.append(another_result)

    return fields_found
Pat Myron
  • 4,437
  • 2
  • 20
  • 39
Becca Petrin
  • 1,494
  • 14
  • 13
  • 2
    You could use fields_found.extend(more_results) instead of running another loop. Would look a bit cleaner in my opinion. – sapit Jun 15 '18 at 10:47
6

Another variation, which includes the nested path to the found results (note: this version doesn't consider lists):

def find_all_items(obj, key, keys=None):
    """
    Example of use:
    d = {'a': 1, 'b': 2, 'c': {'a': 3, 'd': 4, 'e': {'a': 9, 'b': 3}, 'j': {'c': 4}}}
    for k, v in find_all_items(d, 'a'):
        print "* {} = {} *".format('->'.join(k), v)    
    """
    ret = []
    if not keys:
        keys = []
    if key in obj:
        out_keys = keys + [key]
        ret.append((out_keys, obj[key]))
    for k, v in obj.items():
        if isinstance(v, dict):
            found_items = find_all_items(v, key, keys=(keys+[k]))
            ret += found_items
    return ret
Dolan Antenucci
  • 15,432
  • 17
  • 74
  • 100
1

I was not able to get the solutions posted here to work out-of-the-box so I wanted to write something that is a little more flexible.

The recursive function below that should allow you to collect all values that meet some regex pattern for a given key in an arbitrarily deeply set of nested dictionaries and lists.

import re

def search(dictionary, search_pattern, output=None):
    """
    Search nested dictionaries and lists using a regex search
    pattern to match a key and return the corresponding value(s).
    """
    if output is None:
        output = []

    pattern = re.compile(search_pattern)
    
    for k, v in dictionary.items():
        pattern_found = pattern.search(k)
        if not pattern_found:
            if isinstance(v, list):
                for item in v:
                    if isinstance(item, dict):
                        search(item, search_pattern, output)
            if isinstance(v, dict):
                search(v, search_pattern, output)
        else:
            if pattern_found:
                output.append(v)

    return output

If you want to search for a specific term, you can always make your search pattern something like r'\bsome_term\b'.

Justin
  • 707
  • 7
  • 13
  • Fresh approach but does not work with nested such as mylist = { 'sad': [[1,1.2],[2,1.3],[3,1.4]], 'puppy': [[2,2.2],[3,2.3]], 'happy':[[3,4.5],[5,6.7,{'happy':[1,2,3]}]]} – MiKK Jan 03 '22 at 22:36
0

Here is my stab at it:

def keyHole(k2b,o):
  # print "Checking for %s in "%k2b,o
  if isinstance(o, dict):
    for k, v in o.iteritems():
      if k == k2b and not hasattr(v, '__iter__'): yield v
      else:
        for r in  keyHole(k2b,v): yield r
  elif hasattr(o, '__iter__'):
    for r in [ keyHole(k2b,i) for i in o ]:
      for r2 in r: yield r2
  return

Ex.:

>>> findMe = {'Me':{'a':2,'Me':'bop'},'z':{'Me':4}}
>>> keyHole('Me',findMe)
<generator object keyHole at 0x105eccb90>
>>> [ x for x in keyHole('Me',findMe) ]
['bop', 4]
AXE Labs
  • 4,051
  • 4
  • 29
  • 29
0

Following up on @hexerei software's answer and @bruno-bronosky's comment, if you want to iterate over a list/set of keys:

def gen_dict_extract(var, keys):
   for key in keys:
      if hasattr(var, 'items'):
         for k, v in var.items():
            if k == key:
               yield v
            if isinstance(v, dict):
               for result in gen_dict_extract([key], v):
                  yield result
            elif isinstance(v, list):
               for d in v:
                  for result in gen_dict_extract([key], d):
                     yield result    

Note that I'm passing a list with a single element ([key]}, instead of the string key.

0

Since there's a maximum recursion depth in Python I would consider implementing an iterative approach for arbitrary size:

def get_ids(data: Dict, key: str) -> List:
    stack = [data]
    result = []
    while stack:
        elem = stack.pop()
        if isinstance(elem, dict):
            for k, v in elem.items():
                if k == key:
                    result.append(v)
                if isinstance(elem, (list, dict)):
                    stack.append(v)
        elif isinstance(elem, list):
            for obj in elem:
                stack.append(obj)
    return result
funnydman
  • 9,083
  • 4
  • 40
  • 55
0

glom is a great library for searching and restructuring that can also do nested lookups using globs. Example:

In [1]: import glom

In [2]: data = { "id" : "abcde", "key1" : "blah", ... }  # OP example

In [3]: glom.glom(data, '**.id')
Out[3]: ['abcde', 'qwerty', 'xyz', 'fghi', 'asdf', 'yuiop']

The nesting levels are separated by dots (think of them as slashes in Unix globs), single star is a placeholder for one level, double star is a placeholder for multiple levels. In the above example, **.id means "id key on any level". More examples:

In [4]: glom.glom(d, ('*.*.id', glom.Flatten()))
Out[4]: ['qwerty']

This will go through all keys on third level only and extract the values for the id, whereas empty results are discarded (the nested results list is flattened).

Another example that collects id values specifically in the anothernestednestedlist list only:

In [5]: glom.glom(d, ('nestedlist', [('anothernestednestedlist', ['id'])]))
Out[5]: [['asdf', 'yuiop']]

However, the library can do a lot more than just that. Read the glom docs for more features.

hoefling
  • 59,418
  • 12
  • 147
  • 194