Python: Reading 200k JSON files into a Pandas Dataframe

Question

I'm very new to python (< 2 weeks), and was asked to read in 200k+ JSON files I was provided (as-is) into a single database (using python). These JSON files have flat, one level attributes which vary across file from 50 - > 1000, but those 50 are a subset of the 1000.

Here is a snippet of a json file:

{
"study_type" : "Observational",
"intervention.intervention_type" : "Device",
"primary_outcome.time_frame" : "24 months",
"primary_completion_date.type" : "Actual",
"design_info.primary_purpose" : "Diagnostic",
"design_info.secondary_purpose" : "Intervention",
"start_date" : "January 2014",
"end_date" : "March 2014",
"overall_status" : "Completed",
"location_countries.country" : "United States",
"location.facility.name" : "Generic Institution",
}

Our goal is to take this master database of these JSON files, clean up the individual columns, run descriptive statistics on those columns and create a final, cleaned up database.

I'm coming from a SAS background so my thought was to use pandas and create a (very) large dataframe. I've been combing through stack overflow this past week and I've leveraged some learnings, but feel there has to be a way to make this way more efficient.

Below is the code I have written so far - it runs, but very slow (I estimate it will take days, if not weeks, to run even after eliminating unneeded input attributes/columns starting with "result').

In addition, the awkward way I convert the dictionary to a final table leaves the column index numbers above the column name, which I haven't been able to figure out how to remove.

import json, os
import pandas as pd    
from copy import deepcopy

path_to_json = '/home/ubuntu/json_flat/'

#Gets list of files in directory with *.json suffix
list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

#Initialize series
df_list = []

#For every json file found
for js in list_files:

    with open(os.path.join(path_to_json, js)) as data_file:
        data = json.loads(data_file.read())                         #Loads Json file into dictionary
        data_file.close()                                           #Close data file / remove from memory

        data_copy = deepcopy(data)                                  #Copies dictionary file
        for k in data_copy.keys():                                  #Iterate over copied dictionary file
            if k.startswith('result'):                              #If field starts with "X" then delete from dictionary
                del data[k]
        df = pd.Series(data)                                        #Convert Dictionary to Series
        df_list.append(df)                                          #Append to empty series  
        database = pd.concat(df_list, axis=1).reset_index()         #Concatenate series into database

output_db = database.transpose()                                    #Transpose rows/columns
output_db.to_csv('/home/ubuntu/output/output_db.csv', mode = 'w', index=False)

Any thoughts, advice is greatly appreciated. I am completely open to using a different technique or approach entirely (in python) if it's more efficient and still allows us to meet our objectives above.

Thanks!

Note that you can use `json.load()` ([docs](https://docs.python.org/2/library/json.html#json.load)) to read the files directly. No need to add the `read()` and do `json.loads()`. — patrick, Apr 27 '17 at 14:42
Also, why don't you read all the jsons into one big dictionary, then convert that entire dictionary into a pandas `DataFrame` (see [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html)) that you can write to file. — patrick, Apr 27 '17 at 14:44
Thanks Patrick! Appreciate the tip, I'll make the change. It doesn't seem to impact run time too much but every efficiency helps. — RDara, Apr 27 '17 at 14:49
@patrick just saw your second post as I was responding. Let me give that a shot. — RDara, Apr 27 '17 at 14:51

score 3 · Accepted Answer · answered Apr 27 '17 at 17:16

3

I've tried to replicate your approach in a more concise manner, less copies and appending. It works with the example data you supplied but don't know if there are further intricacies in your data set. You can give this a try, I hope the comments help.

import json
import os
import pandas
import io


path_to_json = "XXX"

list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

#set up an empty dictionary
resultdict = {}

for fili in list_files:
    #the with avoids the extra step of closing the file
    with open(os.path.join(path_to_json, fili), "r") as inputjson:
        #the dictionary key is set to filename here, but you could also use e.g. a counter
        resultdict[fili] = json.load(inputjson)
        """
        you can exclude stuff here or later via dictionary comprehensions: 
        http://stackoverflow.com/questions/1747817/create-a-dictionary-with-list-comprehension-in-python
        e.g. as in your example code
        resultdict[fili] = {k:v for k,v in json.load(inputjson).items() if not k.startswith("result")}
        """

#put the whole thing into the DataFrame     
dataframe = pandas.DataFrame(resultdict)

#write out, transpose for desired format
with open("output.csv", "w") as csvout:
    dataframe.T.to_csv(csvout)

answered Apr 27 '17 at 17:16

patrick

4,455
6
44
61

Patrick, this works great! It gets rid of the column index values beautifully. Where I run into an issue (JSONDecodeError) is if I un-comment the exclusion criteria. It looks like the resulting `resultdict[fili]` has a key of the json file name & values containing the key/value pair prior to conversion to a dataframe. Thanks again for any thoughts/suggestions. – RDara Apr 27 '17 at 18:49
1

@RDara when doing that, you'll need to comment out the first step w/out the exclusion. Does that did it? – patrick Apr 27 '17 at 19:57
sorry which first step are your referring to? Basically I'm adding in the following line: `resultdict[file] = {k: v for k,v in json.load(inputjson).items() if not k.startswith("location")}` (changed to "location" since that is in the sample JSON file above) .. Thanks for your patience! – RDara Apr 27 '17 at 20:15
@RDara I mean this one has to be removed: `resultdict[fili] = json.load(inputjson)` EDIT: btw the iterator is `fili` not `file`; not saying it's a good name but it needs to be consistent and I try to avoid Python-internal names such as _file_. – patrick Apr 27 '17 at 20:32
1

Just to add, what originally took me over 90 minutes was reduced to seconds on a subset of 10k files. Just tested on the full 200k+ of JSON files (pulling a subset of fields) and that seems to have worked fairly quickly as well. Can't thank you enough!! – RDara Apr 27 '17 at 20:54
2

You can just pass the filename to `to_csv()` - no need for `open()` there. – John Zwinck Apr 28 '17 at 04:37
@JohnZwinck Good call, I always get that confused with `read_csv()` where that is not possible afaik. – patrick Apr 28 '17 at 13:08
@patrick It's possible with `read_csv()` as well. Most people just pass a filename. – John Zwinck Apr 28 '17 at 15:12

score 1 · Answer 2 · answered Apr 27 '17 at 14:44

Your most critical performance bug is probably this:

database = pd.concat(df_list, axis=1).reset_index()

You do this in a loop, every time adding one more thing to df_list and then concat again. But there is no use of this "database" variable until the end, so you can do this step just once, outside the loop.

With Pandas, "concat" in a loop is a huge anti-pattern. Build your list in the loop, concat once.

The second thing is that you should use Pandas to read the JSON files also: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html

Keep it simple. Write a function that takes a path, calls pd.read_json(), removes rows you don't need (series.str.startswith()), etc.

Once you have that working well, your next step will be to check whether you are CPU limited (CPU usage 100%), or I/O limited (CPU usage much less than 100%).

Thanks John, that one line movement alone changed the runtime significantly. For the life of me I can't get the startswith() function to work with this series (the only way I can use "read_json()" on these files is as: typ='series' and orient='records' w/o error). ----------------------------------------------------------------- `if not data.str.startswith('results'):` (returns a ValueError) — RDara, Apr 27 '17 at 18:42

Python: Reading 200k JSON files into a Pandas Dataframe

2 Answers2