How to read and normalize following json in pandas?

Question

I have seen many json reading problems in stackoverflow using pandas, but still I could not manage to solve this simple problem.

Data

{"session_id":{"0":["X061RFWB06K9V"],"1":["5AZ2X2A9BHH5U"]},"unix_timestamp":{"0":[1442503708],"1":[1441353991]},"cities":{"0":["New York NY, Newark NJ"],"1":["New York NY, Jersey City NJ, Philadelphia PA"]},"user":{"0":[[{"user_id":2024,"joining_date":"2015-03-22","country":"UK"}]],"1":[[{"user_id":2853,"joining_date":"2015-03-28","country":"DE"}]]}}

My attempt

import numpy as np
import pandas as pd
import json
from pandas.io.json import json_normalize

# attempt1
df = pd.read_json('a.json')

# attempt2
with open('a.json') as fi:
    data = json.load(fi)
    df = json_normalize(data,record_path='user',meta=['session_id','unix_timestamp','cities'])

Both of them do not give me the required output.

Required output

      session_id unix_timestamp       cities  user_id joining_date country 
0  X061RFWB06K9V     1442503708  New York NY     2024   2015-03-22      UK   
0  X061RFWB06K9V     1442503708    Newark NJ     2024   2015-03-22      UK

Preferred method

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html

I would love to see implementation of pd.io.json.json_normalize

pandas.io.json.json_normalize(data: Union[Dict, List[Dict]], record_path: Union[str, List, NoneType] = None, meta: Union[str, List, NoneType] = None, meta_prefix: Union[str, NoneType] = None, record_prefix: Union[str, NoneType] = None, errors: Union[str, NoneType] = 'raise', sep: str = '.', max_level: Union[int, NoneType] = None)

Related links

anky · Accepted Answer · 2020-01-04T15:26:05.610

5

Here is another way:

df = pd.read_json(r'C:\path\file.json')

final=df.stack().str[0].unstack()
final=final.assign(cities=final['cities'].str.split(',')).explode('cities')
final=final.assign(**pd.DataFrame(final.pop('user').str[0].tolist()))
print(final)

      session_id unix_timestamp            cities  user_id joining_date  \
0  X061RFWB06K9V     1442503708       New York NY     2024   2015-03-22   
0  X061RFWB06K9V     1442503708         Newark NJ     2024   2015-03-22   
1  5AZ2X2A9BHH5U     1441353991       New York NY     2024   2015-03-22   
1  5AZ2X2A9BHH5U     1441353991    Jersey City NJ     2024   2015-03-22   
1  5AZ2X2A9BHH5U     1441353991   Philadelphia PA     2024   2015-03-22   

  country  
0      UK  
0      UK  
1      UK  
1      UK  
1      UK

edited Jan 04 '20 at 15:26

answered Jan 04 '20 at 15:15

anky

74,114
11
41
70

why did you choose `cities` and `user` here? – Jonnyboi Feb 26 '21 at 02:03
1

@Jonnyboi I dont remember very well since it was over a year ago , but by the looks of it, the read_json returned a list for the same session_id and unix_timestamp which we wanted as rows - hence we exploded it. Then we converted the user (which was a list too but we wanted them as columns) to a dataframe and assigned back. – anky Feb 26 '21 at 03:07

score 3 · Answer 2 · answered Jan 04 '20 at 15:12

Here's one way to do:

import pandas as pd

# lets say d is your json
df = pd.DataFrame.from_dict(d, orient='index').T.reset_index(drop=True)

# unlist each element
df = df.applymap(lambda x: x[0])

# convert user column to multiple cols
df = pd.concat([df.drop('user', axis=1), df['user'].apply(lambda x: x[0]).apply(pd.Series)], axis=1)

      session_id  unix_timestamp  \
0  X061RFWB06K9V      1442503708   
1  5AZ2X2A9BHH5U      1441353991   

                                         cities  user_id joining_date country  
0                        New York NY, Newark NJ     2024   2015-03-22      UK  
1  New York NY, Jersey City NJ, Philadelphia PA     2853   2015-03-28      DE

score 3 · Answer 3 · answered Jan 04 '20 at 15:15

I am using explode with join

s=pd.DataFrame(j).apply(lambda x : x.str[0])
s['cities']=s.cities.str.split(',')
s=s.explode('cities')
s.reset_index(drop=True,inplace=True)
s=s.join(pd.DataFrame(sum(s.user.tolist(),[])))
      session_id  unix_timestamp  ... joining_date country
0  X061RFWB06K9V      1442503708  ...   2015-03-22      UK
1  X061RFWB06K9V      1442503708  ...   2015-03-22      UK
2  5AZ2X2A9BHH5U      1441353991  ...   2015-03-28      DE
3  5AZ2X2A9BHH5U      1441353991  ...   2015-03-28      DE
4  5AZ2X2A9BHH5U      1441353991  ...   2015-03-28      DE
[5 rows x 7 columns]

BhishanPoudel · Answer 4 · 2020-01-04T15:20:35.370

1

Once you have df, then you can merge two parts:

df = pd.read_json('a.json')
df1 = df.drop('user',axis=1)
df2 = json_normalize(df['user'])

df = df1.merge(df2,left_index=True,right_index=True)

edited Jan 04 '20 at 15:20

answered Jan 04 '20 at 15:11

BhishanPoudel

15,974
21
108
169

score 1 · Answer 5 · answered Apr 10 '20 at 00:39

just thought i'd share another means of extracting data from nested json into pandas, for future visitors to this question. Each of the columns is extracted before reading into pandas. jmespath comes in handy here as it allows for easy traversal of json data :

import jmespath
from pprint import pprint
expression = jmespath.compile('''{session_id:session_id.*[],
                                  unix_timestamp : unix_timestamp.*[],
                                  cities:cities.*[],
                                  user_id : user.*[][].user_id,
                                  joining_date : user.*[][].joining_date,
                                  country : user.*[][].country
                              }''')
res = expression.search(data)
pprint(res)

{'cities': ['New York NY, Newark NJ',
            'New York NY, Jersey City NJ, Philadelphia PA'],
 'country': ['UK', 'DE'],
 'joining_date': ['2015-03-22', '2015-03-28'],
 'session_id': ['X061RFWB06K9V', '5AZ2X2A9BHH5U'],
 'unix_timestamp': [1442503708, 1441353991],
 'user_id': [2024, 2853]}

Read data into pandas and split the cities into individual rows:

df = (pd.DataFrame(res)
      .assign(cities = lambda x: x.cities.str.split(','))
      .explode('cities')
     )
df

session_id      unix_timestamp  cities       user_id      joining_date  country
0   X061RFWB06K9V   1442503708  New York NY     2024      2015-03-22    UK
0   X061RFWB06K9V   1442503708  Newark NJ       2024      2015-03-22    UK
1   5AZ2X2A9BHH5U   1441353991  New York NY     2853      2015-03-28    DE
1   5AZ2X2A9BHH5U   1441353991  Jersey City NJ  2853      2015-03-28    DE
1   5AZ2X2A9BHH5U   1441353991  Philadelphia PA 2853      2015-03-28    DE