How to efficiently pre-process and parse JSON data?

Question

We are currently running a program which is already taken an hour to execute (and still not finished), so we are wondering if we can improve our program code so it will run faster.

Our program consists of two parts: first we need to change strings, because we use JSON dictionaries and the data has similar keys ("track") for all items -- if we don't do this, the output gives only the first track. Second, we need to print the JSON data to a csv file.

Snippet of the JSON file: (actual file is around 900 mb)

{
 "VV":{
    "Version":1,
    "Data":[
       {
          "track":[
             {
                "time":"YYYY-MM-DDTHH:MM:SS:MS",
                "tl":{
                   "x":10,
                   "y":11
                },
                "br":{
                   "x":20,
                   "y":20
                }
             },
             {
                "time":"YYYY-MM-DDTHH:MM:SS:MS",
                "tl":{
                   "x":12,
                   "y":15
                },
                "br":{
                   "x":22,
                   "y":23
                }
             }
          ],
          "track":[
             {
                "time":"YYYY-MM-DDTHH:MM:SS:MS",
                "tl":{
                   "x":30,
                   "y":39
                },
                "br":{
                   "x":40,
                   "y":45
                }
             },
             {
                "time":"YYYY-MM-DDTHH:MM:SS:MS",
                "tl":{
                   "x":12,
                   "y":18
                },
                "br":{
                   "x":22,
                   "y":24
                }
             }
          ]
         }
      ]
   }
}

First part of our code:

with open(r'filename.json') as json_file:
    fil = json_file.read()
    i = 0
    print i
    while ('track' in fil) :
        fil = fil.replace('track', 'tr'+str(i), 1)
        i = i + 1
    print i
    input_data = json.loads(fil)

data_d = input_data['VV']['Data'][0]

Second part:

with open(r'output.csv', 'wb') as csv_file:
    writer = csv.writer(csv_file)
    i = 0
    for track, data in data_d.items():
    i = i+1 # Track

    for item in data:

        #TRACK
        item_values = []
        item_values.append(i)

        #DAY
        #print item['time']
        day = item['time'][8:10]
        item_values.append(day)

        #COORDINATEN
        item_values.append(item['tl']['x'])
        item_values.append(item['tl']['y'])
        item_values.append(item['br']['x'])
        item_values.append(item['br']['y'])

        #TIME
        time = item['time'][11:13]+item['time'][14:16]+item['time'][17:19]+item['time'][20:23]
        item_values.append(time)

        writer.writerow(item_values)

I'm trying to understand your question. First, you need to replace 'track' with 'tr'. Second, you need to dump a CSV file. And you want to optimize the process. Is that it ? — Raiyan, Oct 06 '14 at 13:56
Yes, that is exactly our question. We need to replace track with tr0, tr1, tr2, etc.. because we need identical keys. — Marjolein, Oct 06 '14 at 14:01
can you change the upstream output format? For example, output one track per line (like tweets stream) instead of creating one large json object — jfs, Oct 06 '14 at 14:09
I see that you JSON object contains multiple fields named 'track', can you produce a JSON object to contain an array named 'track' instead ? — Raiyan, Oct 06 '14 at 14:09
We cannot change the JSON object, as we do not create it but we only receive it from an external party. The smallest JSON file contain 3 million lines, just to give an estimation about the size of the file. — Marjolein, Oct 06 '14 at 14:12
You should use for i, (track, data) in enumerate(data_d.items()) instead of incrementing i, not a performance issue, just make it more Pythonic — Uxío, Oct 06 '14 at 14:13
@Uxio: if it is Python 2 then `data_d.iteritems()` could be used to avoid creating the list. — jfs, Oct 06 '14 at 14:17

score 3 · Answer 1 · edited Oct 07 '21 at 06:01

The first thing is to measure the time performance of your current code: extract a smallish representative data sample from your input so that you can run the benchmark in a couple of seconds (a minute, tops). Save the output that your code produces, to test that you haven't broken anything later.

first we need to change strings, because we use JSON dictionaries and the data has similar keys ("track") for all items -- if we don't do this, the output gives only the first track.

The second thing is to avoid changing the string (remove the first part of your code). In the worst case (if your 900M file is not actually a json file because json format doesn't support duplicate names within a json object: "When the names within an object are not unique, the behavior of software that receives such an object is unpredictable."), you could use something like multidict() solution adapted for large files e.g., if you use python2 then avoid .items() call that creates a list unnecessary, you could use .iteritems() instead, avoid the copying dict(d), just return defaultdict:

import json
from collections import defaultdict

def multidict(ordered_pairs):
    """Convert duplicate key values to a list."""
    # read all values into list
    d = defaultdict(list)
    for k, v in ordered_pairs:
        d[k].append(v)

    # collapse list that has only 1 item
    for k, v in d.iteritems():
        if len(v) == 1:
            d[k] = v[0]
    return d 

with open('filename.json') as json_file:
    obj = json.load(json_file, object_pairs_hook=multidict)

After each change, measure the time performance again and check that the output is still correct.

For readability, you could rewrite the second part:

import csv

with open('output.csv', 'wb') as csv_file:
    writer = csv.writer(csv_file)
    for i, data in enumerate(data_d.itervalues(), start=1):
        for item in data:
            t = item['time']
            writer.writerow([
                #TRACK
                i,
                #DAY
                t[8:10],
                #COORDINATEN
                item['tl']['x'],
                item['tl']['y'],
                item['br']['x'],
                item['br']['y'],
                #TIME
                t[11:13]+t[14:16]+t[17:19]+t[20:23],
            ])

If using multidict() instead of the string replacement hasn't improved the time performance then you could try to abuse multidict() to change your input format without loading the whole json object in memory:

#!/usr/bin/env python2
import json
import sys
from collections import defaultdict

def write_tracks(ordered_pairs):
    # read all values into list
    d = defaultdict(list)
    for k, v in ordered_pairs:
        d[k].append(v)

    # collapse list that has only 1 item
    for k, v in d.iteritems():
        if k == 'track':
            for tracks in v: # print one track (in json format) per line
                print("\n".join(map(json.dumps, tracks)))
            break
        elif len(v) == 1:
            d[k] = v[0]
    else: # no tracks, return the constructed object
        return d

json.load(sys.stdin, object_pairs_hook=write_tracks) # write tracks

You could use it from the command-line:

$ <filename.json python write_tracks.py | python convert_tracks.py >output.csv

where convert_tracks.py is something like:

#!/usr/bin/env python2
import csv
import json
import sys

def main():
    writer = csv.writer(sys.stdout)
    for i, line in enumerate(sys.stdin, start=1):
        try:
            item = json.loads(line)
        except ValueError:
             pass # ignore errors
        else:
            t = item['time']
            writer.writerow([
                #TRACK
                i,
                #DAY
                t[8:10],
                #COORDINATEN
                item['tl']['x'],
                item['tl']['y'],
                item['br']['x'],
                item['br']['y'],
                #TIME
                t[11:13]+t[14:16]+t[17:19]+t[20:23],
            ])

if __name__ == "__main__":
    main()

@Marjolein: I've added one track per line variant (`write_tracks()`) — jfs, Oct 06 '14 at 15:23
@Marjolein: I've change the code to use stdin/stdout to facilitate time measurements/testing. — jfs, Oct 06 '14 at 15:34

score 1 · Answer 2 · answered Oct 06 '14 at 15:56

Two things jump out at me, both in this line:

while ('track' in fil) :

First- this while loop will execute every time. I would not be surprised if you're getting performance issues from this alone. Every time it completes a replacement, it searches the entire string all over again. that's just really inefficient.

Since you're just using a text object, and not a JSON object, you might be better off using a regex with a function replacement, or another regex based strategy where it finds all the matches once, and then you act on it. Something like this:

i = 0
def sub_track(g):
    global i
    i += 1
    return "tr_%s" % i

RE_track = re.compile('track')
RE_track.sub(sub_track, data)

You could also just use a non-python program like sed and just create a copy of the file with all the occurrences replaced.

Second: I don't think it's smart to search for the bare word "track". You might match actual data. I would at least try to match it as a JSON key and search/replace against a string like "track":[.

that being said, reading and manipuating a 900MB file is going to take a lot of memory. I would probably either try to do this on the commandline with sed or try to figure out a way to do this in chunks of data and read a line + write a line one by one. i assume this is all one line, so you can't iterate over the file descriptor with readlines() and would have to do with ranges of bytes. i never had to deal with analyzing a stream of data before, so can't offer any insight on that.

Hussain Shabbir · Answer 3 · 2014-10-06T18:46:18.000

0

Try like this :-

import json
from pprint import pprint
json_data=open(filename.json)
data = json.load(json_data)
pprint(data)
json_data.close()

edited Oct 06 '14 at 18:46

answered Oct 06 '14 at 13:47

Hussain Shabbir

14,801
5
40
56

I am sorry, but how can this improve our program code? – Marjolein Oct 06 '14 at 13:58
2

I don't think this answer is addressing OP's problem – Raiyan Oct 06 '14 at 14:06
Thanks for all your comments. But just have modified the parsing of json. and really am not sure if this helps to the op – Hussain Shabbir Oct 06 '14 at 14:10

How to efficiently pre-process and parse JSON data?

3 Answers3