3

I'm trying to traverse a dictionary (which has many strings, dicts, lists of dicts), and compare it against another dictionary.

Here's an example:

data = {
  "topic": "Seniors' Health Care Freedom Act of 2007",
  "foo": "bar",
  "last_update": "2011-08-29T20:47:44Z",
  "organisations": [
    {
      "organization_id": "22973",
      "name": "National Health Federation",
      "bar": "baz"
    },
    {
      "organization_id": "27059",
      "name": "A Christian Perspective on Health Issues"
    },
]}

validate = {
  "topic": None,
  "last_update": "next_update",
  "organisations": [
      {
        "organization_id": None,
        "name": None
      }
    ]
}

Essentially, if the item exists in "data", but not in "validate" at the current point, it should be deleted from data.

So in this case, I'd want data["foo"] and data["organisations"][x]["bar"] to be removed from the data dict.

Additionally, if the key in validate has a string value and isn't "None", I want to update the key name in data to that, i.e. "last_update" should become "next_update".

I'm not sure of a good way to do this in Python, my current version removes "foo" but I'm struggling trying to remove nested keys like organisations[x][bar].

This is my current attempt:

def func1(data, validate, parent = None):
  for k, v in sorted(data.items()):
    if not parent:
      if k not in validate:
        data.pop(k, None)

    if isinstance(v, dict):
        func1(v, validate)
    elif isinstance(v, list):
      for val in v:
          func1(val, validate, parent = k)

func1(data, validate)

I tried to use something like this to compare the keys instead but figured it doesn't work well if data has additional keys (appeared to remove wrong keys) since dicts are unsorted so wasn't useful for me:

for (k, v), (k2, v2) in zip(sorted(data.items()), sorted(validate.items())):

I've read similar posts such as How to recursively remove certain keys from a multi-dimensional(depth not known) python dictionary?, but this seems to use a flat set to filter so it doesn't take into account where in the dict the key is located which is important for me - as "last_update" can appear in other lists where I need to keep it.

גלעד ברקן
  • 23,602
  • 3
  • 25
  • 61

3 Answers3

1

Here is a simple recursive function. Well, it used to be simple; and then I added tons of checks and now it's an if forest.

def validate_the_data(data, validate):
  for key in list(data.keys()):
    if key not in validate:
      del data[key]
    elif validate[key] is not None:
      if isinstance(data[key], dict):
        validate_the_data(data[key], validate[key])
      elif isinstance(data[key], list):
        for subdata, subvalidate in zip(data[key], validate[key]):
          if isinstance(subdata, dict) and isinstance(subvalidate, dict):
            validate_the_data(subdata, subvalidate)
      else:
        data[key] = validate[key]

How it works: if data[key] is a dictionary and key is valid, then we want to check the keys in data[key] against the keys in validate[key]. So we do a recursive call, but instead of putting validate in the recursive call, we put validate[key]. Likewise if data[key] is a list.

Assumptions: The above code will fail if one of the list in data contains elements which are not dictionaries, or if data[key] is a dictionary when validate[key] exists but isn't a dictionary or None, or if data[key] is a list when validate[key] exists but isn't a list or None.

Important note about the if forest: The order of the if/else/if/elif/else matters. In particular, we only execute data[key] = validate[key] in the case where we don't have a list. If validate[key] is a list, then data[key] = validate[key] would result in data[key] becoming the same list, and not a copy of the list, which is most certainly not what you want.

Important note about list(data.keys()): I used the iteration for key in list(data.keys()): and not for key in data: or for key, value in data:. Normally this would not be the preferred way of iterating over a dict. But we use del inside the for loop to remove values from the dictionary, which would interfere with the iteration. So we need to get the list of keys before deleting any element, and then use that list to iterate.

Stef
  • 13,242
  • 2
  • 17
  • 28
  • Wouldn't this approach fail if 'subdata' is not a dictionary? You might want to add a type check before the last line. – jeremyr Oct 02 '20 at 18:02
  • @jeremyr hum, yes, it would fail. I also only check whether `data[key]` is a dictionary or a list; it would be more secure to also check that `validate[key]` is accordingly a dictionary or a key. – Stef Oct 02 '20 at 18:04
  • This is fixed now. Thanks @jeremyr. – Stef Oct 02 '20 at 18:22
  • Couldn't resist, had a stab at it myself :) – jeremyr Oct 02 '20 at 19:57
0

Interesting problem! To prevent multitude of if...else..., you would need to to find an approach which allows recursion regardless of the type of incoming values.

So I presume you need the following rules:

  1. If any value from data is None in validate, value in data should be preserved
  2. If values from data and validate are dictionaries, keep only keys from data if also present in validate, and apply these rules recursively to other keys.
  3. If values from data and validate are lists, keep only items from data if also present in validate, and apply these rules recursively to other items.
  4. If any value from data is not None in validate and rule (2) and (3) don't apply, value in data should be replaced by value in validate

Here is my suggestion:

def sanitize(data1, data2):
    """Sanitize *data1* depending on *data2*
    """
    # If value2 is None, simply return value1
    if data2 is None:
        return data1

    # Update value1 recursively if both values are dictionaries.
    elif isinstance(data1, dict) and isinstance(data2, dict):
        return {
            key: sanitize(_value, data2.get(key))
            for key, _value in data1.items()
            if key in data2
        }

    # Update value1 recursively if both values are lists.
    elif isinstance(data1, list) and isinstance(data2, list):
        return [
            sanitize(subvalue1, subvalue2)
            for subvalue1, subvalue2
            in zip(data1, data2)
        ]

    # Otherwise, simply return value2.
    return data2

Using your values, you'd get the following output:

> sanitize(data, validate)
{
    'topic': "Seniors' Health Care Freedom Act of 2007",
    'last_update': 'next_update',
    'organisations': [
        {
            'organization_id': '22973',
            'name': 'National Health Federation'
        }
    ]
}

From rule 3, I presumed that you want to delete all list items from data if not present in validate, hence the removal of the second items from "organisations".

It rule 3 should rather be:

  1. If values from data and validate are lists, apply these rules recursively to other items.

Then you can simply replace the zip function by itertools.zip_longest

jeremyr
  • 425
  • 4
  • 12
0

Dictionary and list comprehensions make quick work of the problem -

def from_schema(t, s):
  if isinstance(t, dict) and isinstance(s, dict):
    return { v if isinstance(v, str) else k: from_schema(t[k], v) for (k, v) in s.items() if k in t }
  elif isinstance(t, list) and isinstance(s, list):
    return [ from_schema(v, s[0]) for v in t if s ]
  else:
    return t

A few line breaks might make the comprehensions more... comprehensible -

def from_schema(t, s):
  if isinstance(t, dict) and isinstance(s, dict):
    return \
      { v if isinstance(v, str) else k: from_schema(t[k], v)
          for (k, v) in s.items()
            if k in t
      }
  elif isinstance(t, list) and isinstance(s, list):
    return \
      [ from_schema(v, s[0])
          for v in t
            if s 
      ]
  else:
    return t
result = from_schema(data, validate)
print(result)
{
    "topic": "Seniors' Health Care Freedom Act of 2007",
    "next_update": "2011-08-29T20:47:44Z",
    "organisations": [
        {
            "organization_id": "22973",
            "name": "National Health Federation"
        },
        {
            "organization_id": "27059",
            "name": "A Christian Perspective on Health Issues"
        }
    ]
}
Mulan
  • 129,518
  • 31
  • 228
  • 259
  • Can you explain `[ from_schema(v, s[0]) for v in t if s ]`? Why are you testing against `s` inside the list comprehension? – Stef Oct 03 '20 at 07:29
  • @Stef very good question. The risk is that `s[0]` could raise an `IndexError` if the schema, `s`, is empty. By adding an `if s` condition to the comprehension, we avoid this pitfall. – Mulan Oct 03 '20 at 17:14
  • Why did you prefer that form over `[from_schema(v, s[0]) for v in t] if s else []` which only has one `if` for the whole list rather than one `if` per element? – Stef Oct 03 '20 at 19:19
  • That is also good, I only chose to do it the way I did because of consistency. Both code branches handling compound data simply return a new dict or list using a comprehension. Changing one to use an "outside" `if` breaks the consistency a little bit. If you mean to suggest performance will be better, I think this is a micro-optimisation. Especially when you consider the `if s` is a static (unchanging) condition and the compiled program is probably optimised w branch prediction. – Mulan Oct 03 '20 at 20:47