I have a 90G file made of json items.Below is a sample of 3 lines only:
{"description":"id1","payload":{"cleared":"2020-01-31T10:23:54Z","first":"2020-01-31T01:29:23Z","timestamp":"2020-01-31T09:50:47Z","last":"2020-01-31T09:50:47Z"}}
{"description":"id2","payload":{"cleared":"2020-01-31T11:01:54Z","first":"2020-01-31T02:45:23Z","timestamp":"2020-01-31T09:50:47Z","last":"2020-01-31T09:50:47Z"}}
{"description":"id3","payload":{"cleared":"2020-01-31T5:33:54Z","first":"2020-01-31T01:29:23Z","timestamp":"2020-01-31T07:50:47Z","last":"2019-01-31T04:50:47Z"}}
The end goal is,for each line, to get the max of first
, cleared
and last
and update timestamp
with max. Then sort all the items by timestamp.Ignore the sorting for now.
I initially jsonified the file to a json file and used the below code:
#!/usr/bin/python
import json as simplejson
from collections import OrderedDict
with open("input.json", "r") as jsonFile:
data = simplejson.load(jsonFile, object_pairs_hook=OrderedDict)
for x in data:
maximum = max(x['payload']['first'],x['payload']['cleared'],x['payload']['last'])
x['payload']['timestamp']= maximum
data_sorted = sorted(data, key = lambda x: x['payload']['timestamp'])
with open("output.json", "w") as write_file:
simplejson.dump(data_sorted, write_file)
The above code worked for a small test file but the script got killed when I ran it for the 90G file.
I then decided to deal with it line by line using the below code:
#!/usr/bin/python
import sys
import json as simplejson
from collections import OrderedDict
first_arg = sys.argv[1]
data = []
with open(first_arg, "r") as jsonFile:
for line in jsonFile:
y = simplejson.loads(line,object_pairs_hook=OrderedDict)
payload = y['payload']
first = payload.get('first', None)
clearedAt = payload.get('cleared')
last = payload.get('last')
lst = [first, clearedAt, last]
maximum = max((x for x in lst if x is not None))
y['payload']['timestamp']= maximum
data.append(y)
with open("jl2json_new.json", "w") as write_file:
simplejson.dump(data, write_file, indent=4)
It still got killed. So I'm wondering about the best way to approach this problem?
I tried the following approach but it wasn't helpful: https://stackoverflow.com/a/21709058/322541