62

I am curious how I can use pandas to read nested json of the following structure:

{
    "number": "",
    "date": "01.10.2016",
    "name": "R 3932",
    "locations": [
        {
            "depTimeDiffMin": "0",
            "name": "Spital am Pyhrn Bahnhof",
            "arrTime": "",
            "depTime": "06:32",
            "platform": "2",
            "stationIdx": "0",
            "arrTimeDiffMin": "",
            "track": "R 3932"
        },
        {
            "depTimeDiffMin": "0",
            "name": "Windischgarsten Bahnhof",
            "arrTime": "06:37",
            "depTime": "06:40",
            "platform": "2",
            "stationIdx": "1",
            "arrTimeDiffMin": "1",
            "track": ""
        },
        {
            "depTimeDiffMin": "",
            "name": "Linz/Donau Hbf",
            "arrTime": "08:24",
            "depTime": "",
            "platform": "1A-B",
            "stationIdx": "22",
            "arrTimeDiffMin": "1",
            "track": ""
        }
    ]
}

This here keeps the array as json. I would rather prefer it to be expanded into columns.

pd.read_json("/myJson.json", orient='records')

edit

Thanks for the first answers. I should refine my question: A flattening of the nested attributes in the array is not mandatory. It would be ok to just [A, B, C] concatenate the df.locations['name'].

My file contains multiple JSON objects (1 per line) I would like to keep number, date, name, and locations column. However, I would need to join the locations.

allLocations = ""
isFirst = True
for location in result.locations:
    if isFirst:
        isFirst = False
        allLocations = location['name']
    else:
        allLocations += "; " + location['name']
allLocations

My approach here does not seem to be efficient / pandas style.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

4 Answers4

76

You can use json_normalize:

import json

with open('myJson.json') as data_file:    
    data = json.load(data_file)  

df = pd.json_normalize(data, 'locations', ['date', 'number', 'name'], 
                    record_prefix='locations_')
print (df)
  locations_arrTime locations_arrTimeDiffMin locations_depTime  \
0                                                        06:32   
1             06:37                        1             06:40   
2             08:24                        1                     

  locations_depTimeDiffMin           locations_name locations_platform  \
0                        0  Spital am Pyhrn Bahnhof                  2   
1                        0  Windischgarsten Bahnhof                  2   
2                                    Linz/Donau Hbf               1A-B   

  locations_stationIdx locations_track number    name        date  
0                    0          R 3932         R 3932  01.10.2016  
1                    1                         R 3932  01.10.2016  
2                   22                         R 3932  01.10.2016 

EDIT:

You can use read_json with parsing name by DataFrame constructor and last groupby with apply join:

df = pd.read_json("myJson.json")
df.locations = pd.DataFrame(df.locations.values.tolist())['name']
df = df.groupby(['date','name','number'])['locations'].apply(','.join).reset_index()
print (df)
        date    name number                                          locations
0 2016-01-10  R 3932         Spital am Pyhrn Bahnhof,Windischgarsten Bahnho... 
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • and json would be the raw file? or file path? – Georg Heiler Nov 14 '16 at 12:43
  • in docs it is `Unserialized JSON objects`, but i test it with dict. – jezrael Nov 14 '16 at 12:44
  • I see. Thanks. And which columns need to go into the ['date', 'number', 'name'] array? That is still a bit unclear for me. For me, it would be enough to str concat the names of all the locations and kee that column. – Georg Heiler Nov 14 '16 at 12:58
  • I think it is first level of data whach are not nested in `[]` . In `locations` are nested data `{}`, so it is second parameter. – jezrael Nov 14 '16 at 13:06
  • @jezrael thanks for the solution. I am working on a similar problem and your solution actually worked. But, it removes the rows with missing values for column 'locations' in this example. Any suggestion on how to read data without removing the rows for any missing values of the locations column? – vishnu prashanth Jul 11 '18 at 13:33
  • @vishnuprashanth - Hard question, do you think `None` or `np.nan` values? It removed `json_normalize` function ? – jezrael Jul 11 '18 at 13:35
  • @jezrael I have few rows that contains values 'date': 'x', 'name':'y', 'locations' : [] and other rows containing values for 'locations'. when I applied json_normalize, it returns back the rows that only contains values for locations and removes the rows that has 'locations': []. I think it is none. [] – vishnu prashanth Jul 11 '18 at 13:42
  • @vishnuprashanth - Then need change `json_normalize` to `pd.DataFrame` contructor. – jezrael Jul 11 '18 at 13:43
  • @jezrael you mean I should try using df.locations = pd.DataFrame(df.locations.values.tolist())['name'] ? – vishnu prashanth Jul 11 '18 at 13:46
  • First try `pd.DataFrame(df.locations.values.tolist())` – jezrael Jul 11 '18 at 13:48
  • I tried that and I got none values for empty 'locations' and the dict for the values present in 'locations' – vishnu prashanth Jul 11 '18 at 13:52
  • @vishnuprashanth - hmmm, then I have no idea :( Maybe the best create new question in SO with sample data – jezrael Jul 11 '18 at 13:53
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/174813/discussion-between-vishnu-prashanth-and-jezrael). – vishnu prashanth Jul 11 '18 at 13:54
  • @jezrael new question posted https://stackoverflow.com/questions/51287654/reading-nested-json-with-none-values-in-python3 – vishnu prashanth Jul 11 '18 at 14:08
  • @jezrael this is excellent! I need to know how to do this method for an array within an array as well - similar to an $unwind in MongoDB. Is it possible? – Matt Lightbourn Dec 02 '21 at 04:50
5

Another option if anyone finds this, as I was working through a notebook. Read the file in as a df with

df = pd.read_json('filename.json')
df2 = pd.DataFrame.from_records(df['nest_level_1']['nest_level_2'])

Happy coding

chromebookdev
  • 305
  • 2
  • 11
0

A possible alternative to pandas.json_normalize is to build your own dataframe by extracting only the selected keys and values from the nested dictionary. The main reason for doing this is because json_normalize gets slow for very large json file (and might not always produce the output you want).

So, here is an alternative way to flatten the nested dictionary in pandas using glom. The aim is to extract selected keys and value from the nested dictionary and save them in a separate column of the pandas dataframe (:

Here is a step by step guide: https://medium.com/@enrico.alemani/flatten-nested-dictionaries-in-pandas-using-glom-7948345c88f5

import pandas as pd
from glom import glom
from ast import literal_eval


target = {
    "number": "",
    "date": "01.10.2016",
    "name": "R 3932",
    "locations":
        {
            "depTimeDiffMin": "0",
            "name": "Spital am Pyhrn Bahnhof",
            "arrTime": "",
            "depTime": "06:32",
            "platform": "2",
            "stationIdx": "0",
            "arrTimeDiffMin": "",
            "track": "R 3932"
        }
}   



# Import data
df = pd.DataFrame([str(target)], columns=['target'])

# Extract id keys and save value into a separate pandas column
df['id'] = df['target'].apply(lambda row: glom(literal_eval(row), 'locations.name'))
iEriii
  • 403
  • 2
  • 7
0

I have a mutiline Json having one json object every line {'a':'b','scope':{'eid':123213}} {'a':'d','scope':{'eid':1343213}}

NO comma seperated. Each line is indepoendent

i used following logic to read nested structure

threshold = pd.read_json(r"/content/data.json",lines=True)

threshold = pd.read_json(r"/content/data.json",lines=True)
threshold['entityId'] = pd.DataFrame.from_records(threshold['scope'])['entityId']
threshold.head()