0

I have a dataframe with 3 columns (INSTNR, Enhedsadresser, API_response), where the 3rd column (API_response) contains JSON objects. I would like to flatten the JSON object and store the extracted information in separate columns within the same df. I am particularly interested in extracting kategori, resultater -> adresse -> id, and resultater -> adresse -> adgangsadresseid information.

I have tried:

data = json_normalize(data=df['API_response'], record_path='resultater',
                            meta=['kategori'], errors='ignore')

but it simply returns TypeError: string indices must be integers

Whereas data = json_normalize(data=df['API_response']) gave me a column with a list of indices...

How can I extract the needed information?

Example of a JSON object:

{
  "kategori": "A",
  "resultater": [
    {
      "adresse": {
        "id": "0a3f50bc-f815-32b8-e044-0003ba298018",
        "vejnavn": "Staldgaardsgade",
        "adresseringsvejnavn": "Staldgaardsgade",
        "husnr": "39A",
        "supplerendebynavn": null,
        "postnr": "7100",
        "postnrnavn": "Vejle",
        "status": 1,
        "virkningstart": "2009-11-24T02:15:25.000Z",
        "virkningslut": null,
        "adgangsadresseid": "0a3f5090-edef-32b8-e044-0003ba298018",
        "etage": "st",
        "dør": "th",
        "href": "https://api.dataforsyningen.dk/adresser/0a3f50bc-f815-32b8-e044-0003ba298018"
      },
      "aktueladresse": {
        "id": "0a3f50bc-f815-32b8-e044-0003ba298018",
        "vejnavn": "Staldgaardsgade",
        "adresseringsvejnavn": "Staldgaardsgade",
        "husnr": "39A",
        "supplerendebynavn": null,
        "postnr": "7100",
        "postnrnavn": "Vejle",
        "status": 1,
        "virkningstart": "2009-11-24T02:15:25.000Z",
        "virkningslut": null,
        "adgangsadresseid": "0a3f5090-edef-32b8-e044-0003ba298018",
        "etage": "st",
        "dør": "th",
        "href": "https://api.dataforsyningen.dk/adresser/0a3f50bc-f815-32b8-e044-0003ba298018"
      },
      "vaskeresultat": {
        "variant": {
          "vejnavn": "Staldgaardsgade",
          "husnr": "39A",
          "etage": "st",
          "dør": "th",
          "supplerendebynavn": null,
          "postnr": "7100",
          "postnrnavn": "Vejle"
        },
        "afstand": 0,
        "forskelle": {
          "vejnavn": 0,
          "husnr": 0,
          "postnr": 0,
          "postnrnavn": 0,
          "etage": 0,
          "dør": 0
        },
        "parsetadresse": {
          "vejnavn": "Staldgaardsgade",
          "husnr": "39A",
          "etage": "st",
          "dør": "th",
          "postnr": "7100",
          "postnrnavn": "Vejle"
        },
        "ukendtetokens": [],
        "anvendtstormodtagerpostnummer": null
      }
    }
  ]
}

Link to API response containing this JSON object: https://api.dataforsyningen.dk/datavask/adresser?betegnelse=Staldgaardsgade%2039A%20st%20th,%207100%20Vejle

EDIT 1

I created GitHub repo with data and python script: https://github.com/mantasbacys/TREFOR

1 Answers1

0

Series.str.get can help you extract individual elements from a dictionary column.

If API_response is a column of dictionaries formatted as your example, you can use:

df['kategori'] = df['api_response'].str.get('kategori')
df['id'] = (
    df['api_response']
    .str.get('resultater')  # get resultater object
    .str.get(0)  # get first element in list
    .str.get('adresse')
    .str.get('id')
)
df['adgangsadresseid'] =  = (
    df['api_response']
    .str.get('resultater')  # get resultater object
    .str.get(0)  # get first element in list
    .str.get('adresse')
    .str.get('adgangsadresseid')
)

Note that this will fail if the format doesn't match, so double check! E.g. if there are multiple elements in the resultater list, you may want to use Series.explode instead.


EDIT:

Your file shows that API_response column actually contains parsed JSON dictionaries saved as strings. This means .str.get methods do not work as the values are not valid dictionaries.

As a quick fix, you should convert the dictionary strings to actual dictionaries first:

def convert(s): 
    """Convert dictionary string to JSON-like format and parse with json module"""
    s = s.replace("'", '"')
    s = s.replace("None", "null")
    return json.loads(s)

df['api_response'] = df.API_response.apply(convert)
# Alternative - UNSAFE! (see https://stackoverflow.com/questions/1832940/why-is-using-eval-a-bad-practice)
# df['api_response'] = df.API_response.apply(eval)

I am assuming this is happening because you first download the data, then you parse it into a dictionary, and save in a file before you load it again. This conversion to a string column happens as you save the file.

To avoid this conversion to string and back, try parsing the JSON and extracting values of interest in a single script.

ozacha
  • 1,212
  • 9
  • 14
  • I get this when I run the code: `AttributeError: Can only use .str accessor with string values!` Isnt it because `.str` method cannot be used with dictionary? All cells in `API_response` column are formatted in the same way as my example. – mantasbacys Jan 20 '22 at 06:46
  • Which line is throwing this error? `.str` can be used on different `object` columns, assuming the format is correct – ozacha Jan 21 '22 at 08:15
  • I get it on line 4: `.str.get('resultater')` – mantasbacys Jan 24 '22 at 07:12
  • See edit; also - alternatively, you can use `.str[item]` instead of `.str.get(item)` – ozacha Jan 24 '22 at 12:26
  • 1
    Great, it worked like a charm after conversion to dict when I changed your code line to `df['API_response'].str.get('resultater').str[0].str.get('adresse').str.get('id')`! Thanks for taking time to help! – mantasbacys Jan 25 '22 at 11:22