0

I have json files which look like a dictionary of a list of similar dictionaries:

{"People":[{"FirstName":"Max","Surname":"Smith"},{"FirstName":"Jane","Surname":"Smart"}],
"Animals":[{"Breed":"Cat","Name":"WhiteSocks"},{"Breed":"Dog","Name":"Zeus"}]}

I'm using the following code to convert this into a dictionary of pandas dataframes:

import pandas as pd
import json

# Read the json file
jsonFile = 'exampleJson.json'
with open(jsonFile) as j:
    data = json.load(j)

# Convert it to a dictionary of dataframes
dfDict = {}
for dfName, dfContents in data.items():
    dfDict[dfName] = pd.DataFrame(dfContents)
    display(dfDict[dfName])

The above code gives me exactly what I want, which is a dictionary of dataframes. However it seems rather inefficient. Is there a way to read the json directly into a dictionary of dataframes, rather than reading it into a json object first and then copying that into a dictionary of dataframes? The files I'm working with will be huge.

Michael
  • 308
  • 1
  • 5
  • 11

2 Answers2

1

You can use json_normalize():

import json

import pandas as pd


json_file = "exampleJson.json"
with open(json_file) as j:
    data = json.load(j)

df_dict = {d: pd.json_normalize(data=data, record_path=d) for d in data}
print(df_dict)
Jason Baker
  • 3,170
  • 2
  • 12
  • 15
  • Sorry, I should have made it clearer. My expected output is a dictionary of dataframes, exactly as I achieved with my code. I just wanted a more efficient way to do it. For example, pd.read_excel converts all the worksheets directly into a dictionary of dataframes. Is there any way of doing that with json? – Michael Jun 04 '23 at 15:55
  • see edit above it is now a dict of dfs – Jason Baker Jun 04 '23 at 16:07
  • Thanks Jason. The only difference is the way you suggest to copy the data from the json object to the dictionary of dataframes. I compared them using timeit%% (with a bigger json file). My original approach gets "217 ms ± 3.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)". Your proposal gets "1.38 s ± 37.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)". So it's about 6 times slower. – Michael Jun 05 '23 at 16:37
0

You should try this code:

import pandas as pd
import json

# Read the json file
jsonFile = 'exampleJson.json'
with open(jsonFile) as j:
    data = pd.json_normalize(json.load(j))

# Convert it to a dictionary of dataframes
 df1=data['People']
 print(df1)
 df2=data['Animals']
 print(df2)
Nimra Tahir
  • 391
  • 1
  • 6
  • Thanks Nimra, but can you explain how this makes it more efficient? Also, I should have mentioned that I want the result in a dictionary of dataframes because there could be many, and I don't know their names. I was hoping for a json equivalent of the command pd.read_excel(), which can read many excel sheets directly into a dictionary of dataframes. – Michael Jun 04 '23 at 14:11
  • Actually, the name should be placed or you can use column index because that would not be appropriate if you just declare context because there is a possibility of misunderstanding of data. – Nimra Tahir Jun 04 '23 at 17:56
  • The json files have hundreds of nested dictionaries, which result in hundreds of dataframes. It is not practical to list them all explicitly as you suggest, nor is it sensible to use column indexes. It is much better to use a dictionary, with keys that have intelligible names which are generated from the data itself. – Michael Jun 05 '23 at 16:44
  • Ahh https://stackoverflow.com/questions/21104592/json-to-pandas-dataframe this might be helpful then. – Nimra Tahir Jun 06 '23 at 06:31
  • Thanks Nimra, but that question deals only with a single dataframe as the desired output, which is not what I want. I need a dictionary of dataframes. At this stage my conclusion is that my original approach is still the best. – Michael Jun 08 '23 at 05:25