7

I have a number of JSON files I need to analyze. I am using iPython (Python 3.5.2 | IPython 5.0.0), reading in the files to a dictionary and appending each dictionary to a list.

My main bottleneck is reading in the files. Some files are smaller, and are read quickly, but the larger files are slowing me down.

Here is some example code (sorry, I cannot provide the actual data files):

import json
import glob

def read_json_files(path_to_file):
    with open(path_to_file) as p:
        data = json.load(p)
        p.close()
    return data

def giant_list(json_files):
    data_list = []
    for f in json_files:
        data_list.append(read_json_files(f))
    return data_list

support_files = glob.glob('/Users/path/to/support_tickets_*.json')
small_file_test = giant_list(support_files)

event_files = glob.glob('/Users/path/to/google_analytics_data_*.json')
large_file_test = giant_list(event_files)

The support tickets are very small in size--largest I've seen is 6KB. So, this code runs pretty fast:

In [3]: len(support_files)
Out[3]: 5278

In [5]: %timeit giant_list(support_files)
1 loop, best of 3: 557 ms per loop

But larger files definitely are slowing me down...these event files can reach ~2.5MB each:

In [7]: len(event_files) # there will be a lot more of these soon :-/
Out[7]: 397

In [8]: %timeit giant_list(event_files)
1 loop, best of 3: 14.2 s per loop

I've researched how to speed up the process and came across this post, however, when using UltraJSON the timing was just slightly worse:

In [3]: %timeit giant_list(traffic_files)
1 loop, best of 3: 16.3 s per loop

SimpleJSON did not do much better:

In [4]: %timeit giant_list(traffic_files)
1 loop, best of 3: 16.3 s per loop

Any tips on how to optimize this code and more efficiently read a lot of JSON files into Python is much appreciated.

Finally, this post is the closest I've found to my question, but deals with one giant JSON file, not many smaller sized ones.

Community
  • 1
  • 1
measureallthethings
  • 1,102
  • 10
  • 26
  • 1
    Your bottleneck is I/O, not parsing speed. Not much to be done other than get a faster disk (do you run on an SSD yet?). – Martijn Pieters Oct 04 '16 at 16:49
  • And `json` in the Python library is the exact same project as `simplejson`. – Martijn Pieters Oct 04 '16 at 16:51
  • @MartijnPieters How did you reach that conclusion? Based on some quick tests, `json.load()` reaches about 46MiB/s on a fast CPU. That's not out of reach for disk-based storage, nevermind SSDs. And that's ignoring the possibility that his input files are cached in memory... – marcelm Oct 04 '16 at 19:05
  • @measureallthethings How long is it currently taking to load those json files? Note that you're trying to read _and parse_ about 1.2GiB of data. Also note that those json entitites may end up using much more memory as Python objects. The integer `5` takes one byte in json, but may cost something like 16 bytes as a Python object. – marcelm Oct 04 '16 at 19:10

1 Answers1

7

Use list comprehension to avoid resizing list multiple times.

def giant_list(json_files):
    return [read_json_file(path) for path in json_files]

You are closing file object twice, simply do it once (on exiting with file would be closed automatically)

def read_json_file(path_to_file):
    with open(path_to_file) as p:
        return json.load(p)

At the end of the day, your problem is I/O bound, but these changes will help a little bit. Also, I have to ask - do you really have to have all these dictionaries in the memory at the same time?

Łukasz Rogalski
  • 22,092
  • 8
  • 59
  • 93
  • Good question--the thousands of smaller files I do not need in memory at the same time. In each case, there are ~5 specific fields I am going to extract, then discard the rest of the dictionary. When it comes to larger event files, I have even more problems...it is Google Analytics data and parsing it makes me cry: https://developers.google.com/analytics/devguides/reporting/core/v4/migration#parsing_the_v4_response. Moreover, I parse it then convert to a Pandas DataFrame...probably going to save that for another post :-/ – measureallthethings Oct 04 '16 at 16:57
  • Even simpler: I ditched the `giant_list()` function and just do a list comprehension directly: `[read_json_file(path) for path in event_files]` – measureallthethings Oct 04 '16 at 17:37
  • would the dictionary be a bottle neck here or did you mention it due to its bad form? – Umar.H Feb 02 '21 at 16:32