How to filter by keys through a nested dictionary in a pythonic way

Question

Try to filter a nested dictionary. My solution is clunky, was hoping to see if there is a better method something using comprehensions. Only interested in the dictionary and lists for this example.

_dict_key_filter() will filter the keys of a nested dictionary or a list of nested dictionaries. Anything not in the obj_filter will be ignored on all nested levels.

obj : can be a dictionary or a list of dictionaries.

obj_filter: has to be a list of filter values

def _dict_key_filter(self, obj, obj_filter):
    if isinstance(obj, dict):
        retdict = {}
        for key, value in obj.iteritems():
            if key in obj_filter:
                retdict[key] = copy.deepcopy(value)
            elif isinstance(value, (dict, list)):
                child = self._dict_key_filter(value, obj_filter)
                if child:
                    retdict[key] = child
        return retdict if retdict else None
    elif isinstance(obj, list):
        retlist = []
        for value in list:
            child = self._dict_key_filter(value, obj_filter)
            if child:
                retlist.append(child)
        return retlist if retlist else None
    else:
        return None

Example#
dict1 = {'test1': {'test2':[1,2]}, 'test3': [{'test6': 2}, 
         {'test8': {'test9': 23}}], 'test4':{'test5': 5}}

filter = ['test5' , 'test9']

return = _dict_key_filter(dict1, filter)

return value would be {'test3': [{'test8': {'test9': 23}}], 'test4': {'test5': 5}}

Can you amend your question with a specification of what `_dict_key_filter` is supposed to do and what parameters it takes? For instance, I would have guessed `obj_filter` was a callable but apparently it is a sequence of keys that are acceptable? — Two-Bit Alchemist, Jul 29 '15 at 20:18
It's not really clear because you're still using just the word "filter" without defining it. By what mechanism is the filter meant to work? Filter anything not appearing in `obj_filter`? At any level? — Two-Bit Alchemist, Jul 29 '15 at 20:54
I have updated the post. obj_filter is used to compare against the nested dictionary, any key from the lowest level node that is not in the obj_filter will be removed. Please see the example on the bottom. — user1539348, Jul 29 '15 at 21:08

score 2 · Accepted Answer · answered Jan 09 '19 at 19:29

It's a really old question. I came across a similar problem recently.

It maybe obvious, but you are dealing with a tree in which each node has an arbitray number of children. You want to cut the subtrees that do not contain some items as nodes (not leaves). To achieve this, you are using a custom DFS: the main function returns either a subtree or None. If the value is None then you "cut" the branch.

First of all, the function dict_key_filter returns a (non empty) dict, a (non empty) list or None if no filter key was not found in the branch. To reduce complexity, you could return a sequence in every case: an empty sequence if no filter key was found, and a non empty sequence if you are still searching or you found the leaf of the tree. Your code would look like:

def dict_key_filter(obj, obj_filter):
    if isinstance(obj, dict):
        retdict = {}
        ...
        return retdict # empty or not
    elif isinstance(obj, list):
        retlist = []
        ...
        return retlist # empty or not
    else:
        return [] # obvioulsy empty

This was the easy part. Now we have to fill the dots.

The `list` case

Let's begin with the list case, since it is the easier to refactor:

retlist = []
for value in obj:
    child = dict_key_filter0(value, obj_filter)
    if child:
        retlist.append(child)

We can translate this into a simple list comprehension:

retlist = [dict_key_filter(value, obj_filter) for value in obj if dict_key_filter(value, obj_filter)]

The drawback is that dict_key_filter is evaluated twice. We can avoid this with a little trick (see https://stackoverflow.com/a/15812866):

retlist = [subtree for subtree in (dict_key_filter(value, obj_filter) for value in obj) if subtree]

The inner expression (dict_key_filter(value, obj_filter) for value in obj) is a generator that calls dict_key_filter once per value. But we can even do better if we build a closure of dict_key_filter:

def dict_key_filter(obj, obj_filter):
    def inner_dict_key_filter(obj): return dict_key_filter(obj, obj_filter)

    ...

    retlist = list(filter(len, map(inner_dict_key_filter, obj)))

Now we are in the functional world: map applies inner_dict_key_filter to every element of the list and then the subtrees are filtered to exclude empty subtrees (len(subtree) is true iff subtree is not empty). Now, the code looks like:

def dict_key_filter(obj, obj_filter):
    def inner_dict_key_filter(obj): return dict_key_filter(obj, obj_filter)

    if isinstance(obj, dict):
        retdict = {}
        ...
        return retdict
    elif isinstance(obj, list):
        return list(filter(len, map(inner_dict_key_filter, obj)))
    else:
        return []

If you are familiar with functional programming, the list case is readable (not quite as readable as it would be in Haskell, but still readable).

The `dict` case

I do not forget the dictionary-comprehension tag in your question. The first idea is to create a function to return either a whole copy of the branch or the result of the rest of the DFS.

def build_subtree(key, value):
    if key in obj_filter:
        return copy.deepcopy(value) # keep the branch
    elif isinstance(value, (dict, list)):
        return inner_dict_key_filter(value) # continue to search
    return [] # just an orphan value here

As in the list case, we do not refuse empty subtrees for now:

retdict = {}
for key, value in obj.items():
    retdict[key] = build_subtree(key, value)

We have now a perfect case for dict comprehension:

retdict = {key: build_subtree(key, value) for key, value in obj.items() if build_subtree(key, value)}

Again, we use the little trick to avoid to compute a value twice:

retdict = {key:subtree for key, subtree in ((key, build_subtree(key, value)) for key, value in obj.items()) if subtree}

But we have a little problem here: the code above is not exaclty equivalent to the original code. What if the value is 0? In the original version, we have retdict[key] = copy.deepcopy(0) but in the new version we have nothing. The 0 value is evaluated as false and filtered. And then the dict may become empty and we cut the branch wrongfully. We need another test to be sure we want to remove a value: if it's an empty list or dict, then remove it, else keep it:

def to_keep(subtree): return not (isinstance(subtree, (dict, list)) or len(subtree) == 0)

That is:

 def to_keep(subtree): return not isinstance(subtree, (dict, list)) or subtree

If you remember a bit of logic (https://en.wikipedia.org/wiki/Truth_table#Logical_implication) you can interpret this as: if subtree is a dict or a list, then it must not be empty.

Let's put the pieces together:

def dict_key_filter(obj, obj_filter):
    def inner_dict_key_filter(obj): return dict_key_filter(obj, obj_filter)
    def to_keep(subtree): return not isinstance(subtree, (dict, list)) or subtree

    def build_subtree(key, value):
        if key in obj_filter:
            return copy.deepcopy(value) # keep the branch
        elif isinstance(value, (dict, list)):
            return inner_dict_key_filter(value) # continue to search
        return [] # just an orphan value here

    if isinstance(obj, dict):
        key_subtree_pairs = ((key, build_subtree(key, value)) for key, value in obj.items())
        return {key:subtree for key, subtree in key_subtree_pairs if to_keep(subtree)}
    elif isinstance(obj, list):
        return list(filter(to_keep, map(inner_dict_key_filter, obj)))
    return []

I don't know if this is more pythonic, but it seems clearer to me.

dict1 = {
    'test1': { 'test2':[1,2] }, 
    'test3': [
        {'test6': 2}, 
        {
            'test8': { 'test9': 23 }
        }
    ],
    'test4':{'test5': 0}
}

obj_filter = ['test5' , 'test9']

print (dict_key_filter(dict1, obj_filter))
# {'test3': [{'test8': {'test9': 23}}], 'test4': {'test5': 0}}

How to filter by keys through a nested dictionary in a pythonic way

1 Answers1

The list case

The dict case

The `list` case

The `dict` case