I'm very new to python (< 2 weeks), and was asked to read in 200k+ JSON files I was provided (as-is) into a single database (using python). These JSON files have flat, one level attributes which vary across file from 50 - > 1000, but those 50 are a subset of the 1000.
Here is a snippet of a json file:
{
"study_type" : "Observational",
"intervention.intervention_type" : "Device",
"primary_outcome.time_frame" : "24 months",
"primary_completion_date.type" : "Actual",
"design_info.primary_purpose" : "Diagnostic",
"design_info.secondary_purpose" : "Intervention",
"start_date" : "January 2014",
"end_date" : "March 2014",
"overall_status" : "Completed",
"location_countries.country" : "United States",
"location.facility.name" : "Generic Institution",
}
Our goal is to take this master database of these JSON files, clean up the individual columns, run descriptive statistics on those columns and create a final, cleaned up database.
I'm coming from a SAS background so my thought was to use pandas and create a (very) large dataframe. I've been combing through stack overflow this past week and I've leveraged some learnings, but feel there has to be a way to make this way more efficient.
Below is the code I have written so far - it runs, but very slow (I estimate it will take days, if not weeks, to run even after eliminating unneeded input attributes/columns starting with "result').
In addition, the awkward way I convert the dictionary to a final table leaves the column index numbers above the column name, which I haven't been able to figure out how to remove.
import json, os
import pandas as pd
from copy import deepcopy
path_to_json = '/home/ubuntu/json_flat/'
#Gets list of files in directory with *.json suffix
list_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
#Initialize series
df_list = []
#For every json file found
for js in list_files:
with open(os.path.join(path_to_json, js)) as data_file:
data = json.loads(data_file.read()) #Loads Json file into dictionary
data_file.close() #Close data file / remove from memory
data_copy = deepcopy(data) #Copies dictionary file
for k in data_copy.keys(): #Iterate over copied dictionary file
if k.startswith('result'): #If field starts with "X" then delete from dictionary
del data[k]
df = pd.Series(data) #Convert Dictionary to Series
df_list.append(df) #Append to empty series
database = pd.concat(df_list, axis=1).reset_index() #Concatenate series into database
output_db = database.transpose() #Transpose rows/columns
output_db.to_csv('/home/ubuntu/output/output_db.csv', mode = 'w', index=False)
Any thoughts, advice is greatly appreciated. I am completely open to using a different technique or approach entirely (in python) if it's more efficient and still allows us to meet our objectives above.
Thanks!