Extracting information from a json.file where the field is in different places in various dicts

Question

I'm extracting from a nested json.file in python 3.8 with lots of dicts and getting the following key error:

extended_tweet = data[str(i)]['extended_tweet']['full_text'] KeyError: 'extended_tweet'

How can I search in nested json.files for a field, which is hided in different structures in each dict? I think my inflexibility way of defining the fields is preventing the right output, but I can't figure out how to fix it.

for i in data:
    date = data[str(i)]['created_at']
    account = data[str(i)]['user']['name']
    location = data[str(i)]['user']['location']
    truncated = data[str(i)]['truncated']
    tweet = data[str(i)]['text']
    extended_tweet = data[str(i)]['extended_tweet']['full_text']
    retweeted_status = data[str(i)]['retweeted_status']['extended_tweet']['full_text']
    if truncated == 'True':
        print(truncated, date, account, location, extended_tweet)
    elif 'RT' in tweet:
        print(truncated, date, account, location, retweeted_status)
    else:
        print(truncated, date, account, location, tweet)

Here is an example of one dict in my json.file. The number "3" stands for the dict and I need do get the data from the field extended_tweet.full_text. Every pathfinder displays the path x.extended_tweet.full_text. But if I'm using this, I'm getting the error shown above.

"3": {
  "created_at": "time",
  "id": id,
  "id_str": "id",
  "text": "text",
  "display_text_range": [
   0,
   140
  ],
  "source": "",
  "truncated": true,
  "in_reply_to_status_id": null,
  "in_reply_to_status_id_str": null,
  "in_reply_to_user_id": null,
  "in_reply_to_user_id_str": null,
  "in_reply_to_screen_name": null,
  "user": {
   "id": ,
   "id_str": "",
   "name": "",
   "screen_name": "name",
   "location": "location",
   "url": "url",
   "description": "description",
   "translator_type": "none",
   "derived": {
    "locations": [
     {
      "country": "country",
      "country_code": "land",
      "locality": "locality",
      "region": "region",
      "full_name": "full_name",
      "geo": {
       "coordinates": [
        number,
        number
       ],
       "type": "point"
      }
     }
    ]
   },
   "protected": false,
   "verified": true,
   "followers_count": number,
   "friends_count": number,
   "listed_count": number,
   "favourites_count": number,
   "statuses_count": number,
   "created_at": "time",
   "utc_offset": null,
   "time_zone": null,
   "geo_enabled": false,
   "lang": null,
   "contributors_enabled": false,
   "is_translator": false,
   "profile_background_color": "number",
   "profile_background_image_url": "gif",
   "profile_background_image_url_https": "link",
   "profile_background_tile": true,
   "profile_link_color": "607696",
   "profile_sidebar_border_color": "FFFFFF",
   "profile_sidebar_fill_color": "EFEFEF",
   "profile_text_color": "333333",
   "profile_use_background_image": true,
   "profile_image_url": "link",
   "profile_image_url_https": "link",
   "profile_banner_url": "bannerurl",
   "default_profile": false,
   "default_profile_image": false,
   "following": null,
   "follow_request_sent": null,
   "notifications": null
  },
  "geo": null,
  "coordinates": null,
  "place": null,
  "contributors": null,
  "is_quote_status": false,
  "extended_tweet": {
   "full_text": "full_text",

KeyError is telling you there is no ```extended_tweet``` field in the tweet, so you need to handle fields that may not exists. Stack Overflow (SO) usually has answers to most coding problems, so searching SO is the recommended. Here is an example that explains how to solve your problem: https://stackoverflow.com/questions/10116518/im-getting-key-error-in-python — pink spikyhairman, May 02 '20 at 08:34
Thank you. I added an example for a better explanation of my situation. I tried all the pathfinder tests and wasn't successful. Than you for the link. I checked the examples but they didn't really ft to my case. So therefore I integrated an example of a dict of mine. So maybe you can help me out? — tester, May 02 '20 at 19:30

pink spikyhairman · Accepted Answer · 2020-05-03T18:14:11.810

Hi tester :) I put your JSON example in a file, put some values in various fields and added a retweeted_status object, then basically ran your code like this:

import json
import os

with open( os.path.join(os.path.realpath('.'), 'src/test/x.json') ) as file1:
    data = json.load(file1)

for i in data:
    date = data[str(i)]['created_at']
    account = data[str(i)]['user']['name']
    location = data[str(i)]['user']['location']
    truncated = data[str(i)]['truncated']
    tweet = data[str(i)]['text']
    extended_tweet = data[str(i)]['extended_tweet']['full_text']
    retweeted_status = data[str(i)]['retweeted_status']['extended_tweet']['full_text']
    if truncated == 'True':
        print(truncated, date, account, location, extended_tweet)
    elif 'RT' in tweet:
        print(truncated, date, account, location, retweeted_status)
    else:
        print(truncated, date, account, location, tweet)

Works fine for me and prints:

True time  location text

Here is the JSON I put in a file:

{"3": {
    "created_at": "time",
    "id": 1234,
    "id_str": "id",
    "text": "text",
    "display_text_range": [
     0,
     140
    ],
    "source": "",
    "truncated": true,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
     "id": 1234,
     "id_str": "",
     "name": "",
     "screen_name": "name",
     "location": "location",
     "url": "url",
     "description": "description",
     "translator_type": "none",
     "derived": {
      "locations": [
       {
        "country": "country",
        "country_code": "land",
        "locality": "locality",
        "region": "region",
        "full_name": "full_name",
        "geo": {
         "coordinates": [
          100,
          100
         ],
         "type": "point"
        }
       }
      ]
     },
     "protected": false,
     "verified": true,
     "followers_count": 100,
     "friends_count": 100,
     "listed_count": 100,
     "favourites_count": 100,
     "statuses_count": 100,
     "created_at": "time",
     "utc_offset": null,
     "time_zone": null,
     "geo_enabled": false,
     "lang": null,
     "contributors_enabled": false,
     "is_translator": false,
     "profile_background_color": "number",
     "profile_background_image_url": "gif",
     "profile_background_image_url_https": "link",
     "profile_background_tile": true,
     "profile_link_color": "607696",
     "profile_sidebar_border_color": "FFFFFF",
     "profile_sidebar_fill_color": "EFEFEF",
     "profile_text_color": "333333",
     "profile_use_background_image": true,
     "profile_image_url": "link",
     "profile_image_url_https": "link",
     "profile_banner_url": "bannerurl",
     "default_profile": false,
     "default_profile_image": false,
     "following": null,
     "follow_request_sent": null,
     "notifications": null
    },
    "geo": null,
    "coordinates": null,
    "place": null,
    "contributors": null,
    "is_quote_status": false,
    "extended_tweet": {
     "full_text": "full_text"
    },
    "retweeted_status": {
        "extended_tweet": {
            "full_text": "full_text"
        }
       }
   }}

Looking at the full data, it's clear that sometimes elements do not exist. The way to handle missing keys without using exceptions is to use the dict get method. This method allows a default value to be returned if the key is missing. Here is code that handles missing elements in extended and retweeted tweets without causing exceptions and will print what is missing. This code processes all 499 tweets in your data.

full_tweet = data[str(i)]
extended_tweet = full_tweet.get('extended_tweet', 'extended_tweet missing')
if extended_tweet != 'extended_tweet missing':
    extended_tweet = extended_tweet.get('full_text', 'full_text missing')
retweeted_status = full_tweet.get('retweeted_status', 'retweeted_status missing')
if retweeted_status != 'retweeted_status missing':
    retweeted_status = retweeted_status.get('extended_tweet', 'extended_tweet missing')
    if retweeted_status != 'extended_tweet missing':
        retweeted_status = retweeted_status['full_text']

Thanks for your research pink spikyhairman :) . The "text" output works for me aswell, but I need the output "full_text" from the jsonfile and that's the thing that keeps my busy. — tester, May 03 '20 at 10:24
If I change the ```text``` field in the JSON to ```RT``` it gives the ```full_text``` field as expected. Maybe I misunderstand what you want? — pink spikyhairman, May 03 '20 at 10:53
Thank you. I'm pretty sure you understand my problem very well. But the code still doesn't work. Maybe there is a problem at another point which regards to my specific json.file. I know you put already a lot of time in my question and I'm very thankful. But maybe you can take a look at my original json.file: https://codeshare.io/Gq0djA — tester, May 03 '20 at 16:18

Extracting information from a json.file where the field is in different places in various dicts

1 Answers1