What is the most efficient way to create a DataFrame from a JSON file in Python?

Question

I have a JSON file that I want to convert into a DataFrame object in Python. I found a way to do the conversion but unfortunately it takes ages, and thus I'm asking if there are more efficient and elegant ways to do the conversion.

I use json library to open the JSON file as a dictionary which works fine:

import json

with open('path/file.json') as d:
file = json.load(d)

Here's some mock data that mimics the structure of the real data set:

dict1 = {'first_level':[{'A': 'abc',
                     'B': 123,
                     'C': [{'D' :[{'E': 'zyx'}]}]},
                    {'A': 'bcd',
                     'B': 234,
                     'C': [{'D' :[{'E': 'yxw'}]}]},
                    {'A': 'cde',
                     'B': 345},
                    {'A': 'def',
                     'B': 456,
                     'C': [{'D' :[{'E': 'xwv'}]}]}]}

Then I create an empty DataFrame and append the data that I'm interested in to it with a for loop:

df = pd.DataFrame(columns = ['A', 'B', 'C'])

for i in range(len(dict1['first_level'])):
try:
    data = {'A': dict1['first_level'][i]['A'],
            'B': dict1['first_level'][i]['B'],
            'C': dict1['first_level'][i]['C'][0]['D'][0]['E']}
    df = df.append(data, ignore_index = True)
except KeyError:
    data = {'A': dict1['first_level'][i]['A'],
            'B': dict1['first_level'][i]['B']}
    df = df.append(data, ignore_index = True)

Is there a way to get the data straight from the JSON more efficiently or can I write the for loop more elegantly?

(Running through the dataset(~150k elements) takes over an hour. I'm using Python 3.6.3 64bits)

Possible duplicate of [Parsing values from a JSON file?](https://stackoverflow.com/questions/2835559/parsing-values-from-a-json-file) — baduker, Mar 12 '18 at 18:09
I have no problem accessing the data which is what was asked in the linked thread. — mad_datter, Mar 12 '18 at 19:27

score 0 · Answer 1 · answered Mar 12 '18 at 18:10

0

You could use https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html

Or use Spark & PySpark to convert to a dataframe pretty easily & manage your data that way but that might be more than you need.

answered Mar 12 '18 at 18:10

Dylan Moore

443
4
14

read_json() gives me an DataFrame with dictionaries that I have to parse through. Is it more efficient to transform the data this way? – mad_datter Mar 12 '18 at 19:27
It depends on what your use-case is. You can explode the dict using pyspark and get columns for each relationship in the data struct. Alternatively you can assign a schema beforehand and expect that transformation in your new object. – Dylan Moore Mar 12 '18 at 20:12

What is the most efficient way to create a DataFrame from a JSON file in Python?

1 Answers1