Extracting Key from multilevel (scraped) complex structure json file in python

Question

I have a multilevel/complex json file - twitter.json and I want to extract ONLY the author ID from this json file.

This is how my file 'twitter.json' looks:

[
[
    {
        "tweets_results": [
            {
                "meta": {
                    "result_count": 0
                }
            }
        ],
        "youtube_link": "www.youtube.com/channel/UCl4GlGXR0ED6AUJU1kRhRzQ"
    }
],
[
    {
        "tweets_results": [
            {
                "data": [
                    {
                        "author_id": "125959599",
                        "created_at": "2021-06-12T15:16:40.000Z",
                        "id": "1403732993269649410",
                        "in_reply_to_user_id": "125959599",
                        "lang": "pt",
                        "public_metrics": {
                            "like_count": 0,
                            "quote_count": 0,
                            "reply_count": 1,
                            "retweet_count": 0
                        },
                        "source": "Twitter for Android",
                        "text": "⌨️ Canais do YouTube:\n\n1 - Alexandre Garcia: Canal de Brasília"
                    },
                    {
                        "author_id": "521827796",
                        "created_at": "2021-06-07T20:23:08.000Z",
                        "id": "1401998177943834626",
                        "in_reply_to_user_id": "623794755",
                        "lang": "und",
                        "public_metrics": {
                            "like_count": 0,
                            "quote_count": 0,
                            "reply_count": 0,
                            "retweet_count": 0
                        },
                        "source": "TweetDeck",
                        "text": "@thelittlecouto"
                    }
                ],
                "meta": {
                    "newest_id": "1426546114115870722",
                    "oldest_id": "1367808835403063298",
                    "result_count": 7
                }
            }
        ],
        "youtube_link": "www.youtube.com/channel/UCm0yTweyAa0PwEIp0l3N_gA"
    }
]
]

I have read through many similar SO questions (including but not limited to):

But the structures of those jsons are pretty simple and when I try to replicate that, I hit errors.

From what I read, contents.tweets_results.data.author_id is how the reference would go. And I am loading using contents = json.load(open("twitter.json")). Any help is appreciated.

EDIT: Both @sammywemmy's and @balderman's code worked for me. I accepted @sammywemmy's because I used that code, but I wanted to credit them both in some way.

score 1 · Accepted Answer · answered Sep 17 '21 at 07:46

Your data has a path to it, You've got a list nested in a list, within the inner list, you have a tweets_results key, whose values is a list of dicts; one of them has a data key, which contains a list/array, which contains a dictionary, where one of the keys is author_id. We can simulate the path (sort of) as : '[][].tweets_results[].data[].author_id'

A rehash sort of : Hit the First list, then the inner list, then access the tweets_results key, then access the list of values; within that list of values, access the data key, within the list of values associated with data, access the author_id:

With this path, one can use jmespath to pull out the author_ids :

# pip install jmespath
import jmespath
              # similar to re.compile
expression = jmespath.compile('[][].tweets_results[].data[].author_id')
expression.search(data)
['125959599', '521827796']

jmespath is quite useful if you want to build a data structure from nested dicts; if however, you are only concerned with the values for author_id, you can use nested_lookup instead; it recursively searches for the keys and returns the values:

# pip install nested-lookup
from nested_lookup import nested_lookup
nested_lookup('author_id', data)
['125959599', '521827796']

A small follow up. So this is how I have loaded my json: ```with open("twitter.json"), 'r', encoding="utf8") as f: contents = json.load(f)``` Should I just ```nested_lookup('author_id', contents)```? — Nilima, Sep 17 '21 at 07:53
How do I verify that? For me, ```contents``` seems to be a list. (You said it with more elaboration dictionary inside a list inside a list) — Nilima, Sep 17 '21 at 07:58
sorry, I meant list. It should match the data structure you shared. Did you encounter any issues when you ran the nested_lookup? — sammywemmy, Sep 17 '21 at 07:59

score 1 · Answer 2 · answered Sep 17 '21 at 07:58

See below (no external lib is involved)

data = [
[
    {
        "tweets_results": [
            {
                "meta": {
                    "result_count": 0
                }
            }
        ],
        "youtube_link": "www.youtube.com/channel/UCl4GlGXR0ED6AUJU1kRhRzQ"
    }
],
[
    {
        "tweets_results": [
            {
                "data": [
                    {
                        "author_id": "125959599",
                        "created_at": "2021-06-12T15:16:40.000Z",
                        "id": "1403732993269649410",
                        "in_reply_to_user_id": "125959599",
                        "lang": "pt",
                        "public_metrics": {
                            "like_count": 0,
                            "quote_count": 0,
                            "reply_count": 1,
                            "retweet_count": 0
                        },
                        "source": "Twitter for Android",
                        "text": "⌨️ Canais do YouTube:\n\n1 - Alexandre Garcia: Canal de Brasília"
                    },
                    {
                        "author_id": "521827796",
                        "created_at": "2021-06-07T20:23:08.000Z",
                        "id": "1401998177943834626",
                        "in_reply_to_user_id": "623794755",
                        "lang": "und",
                        "public_metrics": {
                            "like_count": 0,
                            "quote_count": 0,
                            "reply_count": 0,
                            "retweet_count": 0
                        },
                        "source": "TweetDeck",
                        "text": "@thelittlecouto"
                    }
                ],
                "meta": {
                    "newest_id": "1426546114115870722",
                    "oldest_id": "1367808835403063298",
                    "result_count": 7
                }
            }
        ],
        "youtube_link": "www.youtube.com/channel/UCm0yTweyAa0PwEIp0l3N_gA"
    }
]
]

ids = []
for entry in data:
  for sub in entry:
   result = sub['tweets_results']
   if result[0].get('data'):
    info = result[0]['data']
    for item in info:
      ids.append(item.get('author_id','not_found'))
print(ids)

output

['125959599', '521827796']

Question - I have a file, not a string. So when you do ```data = [...]```, you take them in as a string. I am doing a ```content = json.load(file)``` — Nilima, Sep 17 '21 at 08:00
@Nilima - what you are doing is correct and will lead to the desired output. Give it a try. Just do `data = json.load(file)` — balderman, Sep 17 '21 at 08:01

Extracting Key from multilevel (scraped) complex structure json file in python

2 Answers2