I have a number of JSON files I need to analyze. I am using iPython (Python 3.5.2 | IPython 5.0.0
), reading in the files to a dictionary and appending each dictionary to a list.
My main bottleneck is reading in the files. Some files are smaller, and are read quickly, but the larger files are slowing me down.
Here is some example code (sorry, I cannot provide the actual data files):
import json
import glob
def read_json_files(path_to_file):
with open(path_to_file) as p:
data = json.load(p)
p.close()
return data
def giant_list(json_files):
data_list = []
for f in json_files:
data_list.append(read_json_files(f))
return data_list
support_files = glob.glob('/Users/path/to/support_tickets_*.json')
small_file_test = giant_list(support_files)
event_files = glob.glob('/Users/path/to/google_analytics_data_*.json')
large_file_test = giant_list(event_files)
The support tickets are very small in size--largest I've seen is 6KB. So, this code runs pretty fast:
In [3]: len(support_files)
Out[3]: 5278
In [5]: %timeit giant_list(support_files)
1 loop, best of 3: 557 ms per loop
But larger files definitely are slowing me down...these event files can reach ~2.5MB each:
In [7]: len(event_files) # there will be a lot more of these soon :-/
Out[7]: 397
In [8]: %timeit giant_list(event_files)
1 loop, best of 3: 14.2 s per loop
I've researched how to speed up the process and came across this post, however, when using UltraJSON the timing was just slightly worse:
In [3]: %timeit giant_list(traffic_files)
1 loop, best of 3: 16.3 s per loop
SimpleJSON did not do much better:
In [4]: %timeit giant_list(traffic_files)
1 loop, best of 3: 16.3 s per loop
Any tips on how to optimize this code and more efficiently read a lot of JSON files into Python is much appreciated.
Finally, this post is the closest I've found to my question, but deals with one giant JSON file, not many smaller sized ones.