Filtering Json array (with no root node name) based on nested multiple conditions in python

Question

I'm trying to filter a Json array using Python based on multiple conditions. My Json is similar to this (no root name):

     {           
       "id": "123455",           
       "outter": {
          "inner": [
            {
              "nox": "abc:6666",
              "code": "1329",        
            }
           ],    
        },
        "topic": {
         "reference": "excel"
        }, 
        "date1": "1990-07-28T03:52:44-04:00",
        "finalDate": "1990-07-28T03:52:44-04:00"
      }
      {           
       "id": "123435",           
       "outter": {
          "inner": [
            {
              "nox": "abc:6666",
              "code": "9351",        
            }
           ],    
        },
        "topic": {
         "reference": "excel"
        }, 
        "date1": "1990-07-28T03:52:44-04:00",
        "finalDate": "1995-07-28T03:52:44-04:00"
      }

My goal is to filter based on 2 conditions and return all that match them.

1: outter --> inner --> code = 9351 AND

2: finalDate >= 1995

So far I can do this check separate with no problem with the following code:

   data = pd.read_json('myFile.ndjson', lines = True)

   for item in data['outter']:
      for x in item['inner']:
         if(x['code'] == '9351'):
            found....

but not sure how to do both at the same time since I have to start the loop with either data['outter'] or data['finalDate'] and inside the loop I have only visibility to that element of the array, not the complete array.

Any help is appreciated, thanks!

you have a minor typo - as it looks like you have a json array you can wrap the input json with braces `[]` and separate the elements with a comma. Also coincidentally this be used as a Python `list` object as is. — rv.kvetch, Sep 21 '21 at 02:53

rv.kvetch · Answer 1 · 2021-09-21T04:28:08.177

Here's one solution that can filter the list as mentioned. I'm using list comprehensions instead of loops and there's probably some stuff in there that could be improved, but the result seems to at least be as expected.

Note: This uses the walrus := operator which is introduced in Python 3.8. If you're running in an earlier Python version, you can probably remove the code that uses it, but I haven't bothered too much in this case.

from pprint import pprint


data_list = [
    {
        "id": "123455",
        "outter": {
            "inner": [
                {
                    "nox": "abc:6666",
                    "code": "1329",
                }
            ],
        },
        "topic": {
            "reference": "excel"
        },
        "date1": "1990-07-28T03:52:44-04:00",
        "finalDate": "1990-07-28T03:52:44-04:00"
    },
    {
        "id": "123435",
        "outter": {
            "inner": [
                {
                    "nox": "abc:6666",
                    "code": "9351",
                }
            ],
        },
        "topic": {
            "reference": "excel"
        },
        "date1": "1990-07-28T03:52:44-04:00",
        "finalDate": "1995-07-28T03:52:44-04:00"
    }
]

result = [d for d in data_list
          if (year := d['finalDate'][:4]).isnumeric() and year >= '1995'
          and any(str(inner['code']) == '9351'
                  for inner in d['outter']['inner'] or [])]

pprint(result)

@ted made a good point that readability counts, so I had some time to go back and write it in a typical loop format (same logic as above essentially). I also a lot of comments to hopefully clarify on what's going on in code, hope you find it to be helpful :-)

from pprint import pprint
from typing import Dict, List

result = []

# Looping over each dictionary in list
for d in data_list:
    # Grab the year part, first four characters of `finalDate`
    year = d['finalDate'][:4]
    # isnumeric() to confirm the year part is an an integer
    # then check if string (now confirmed to be numeric) has a higher
    # ASCII value than 1995.
    valid_year = year.isnumeric() and year >= '1995'
    # Simple, if it doesn't match our desired year then continue
    # with next loop iteration
    if not valid_year:
        continue
    # Get inner list, then loop over it
    inner_list: List[Dict] = d['outter']['inner']
    for inner in inner_list:
        if inner['code'] == '9351':
            # We found an inner with our desired code!
            break
    # The for-else syntax is pretty self-explanatory. This `else` statement is
    # only run when we don't `break` out of the loop above.
    else:
        # No break statement was run, so the desired code did not match
        # any elements. Again, continue with the next iteration.
        continue
    # At this point, we know it's valid since it matched both our
    # conditions, so we add it to the result list.
    result.append(d)

# Print the result, but it should be the same
pprint(result)

While this may solve the issue, I believe a fancy one-liner comprehension list with multiple inline conditions and walrus operators might not be a very clear solution for the person asking the question to understand and improve. "Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts." — ted, Sep 21 '21 at 03:33
Yep, point taken. While I did try to split it up into multiple lines for readability so no line is too long, in hindsight using list comprehensions and walrus operators to solve the problem as quickly as possible - might *not* have been the clearest solution overall. — rv.kvetch, Sep 21 '21 at 03:56
@ted I did update my answer to provide an alternate solution that works for <3.8 with hopefully some improved comments about what's going on — rv.kvetch, Sep 21 '21 at 04:10
@rv.kvetch, when executing your code I'm getting the following: 'TypeError: string indices must be integers' for line --> year = d['recordedDate'][:4] ...in this case, d is retuning just the node names id, outter, inner, topic, date1, finalDate ... what could be missing? — Rolando F, Sep 21 '21 at 05:04
The most obvious thing I can think of is that `d` (or actually your whole object) is just a json string. Likely you will need to pass it in to `json.loads` to see if that works. — rv.kvetch, Sep 21 '21 at 05:17
to debug you can insert a breakpoint or print `type(d)` to see what type you're working with - I suspect its a string — rv.kvetch, Sep 21 '21 at 05:18
If you use the `data_list` from the first part alternatively, it should work — rv.kvetch, Sep 21 '21 at 05:34
@rv.kvetch, yes, it was a json string. I used json.load as you recommended and works as expected. Thanks!!! — Rolando F, Sep 21 '21 at 05:37
awesome, I'm glad. by the way, if this helped, please consider accepting it as an answer. — rv.kvetch, Sep 23 '21 at 23:14

GP2 · Answer 2 · 2021-09-21T03:42:55.750

0

Try something like this one.. Change filters per your need :)

for item in data:
    if item['finalDate'].startswith("1995"):
        for inner in item['outter']['inner']:
            if inner['code'] == '9351':
                print(item)

edited Sep 21 '21 at 03:42

answered Sep 21 '21 at 03:27

GP2

2,328
1
14
6

This will only work for finalDate with the year `1995` :-) – rv.kvetch Sep 21 '21 at 03:28

Filtering Json array (with no root node name) based on nested multiple conditions in python

2 Answers2