0

I have a 90G file made of json items.Below is a sample of 3 lines only:

{"description":"id1","payload":{"cleared":"2020-01-31T10:23:54Z","first":"2020-01-31T01:29:23Z","timestamp":"2020-01-31T09:50:47Z","last":"2020-01-31T09:50:47Z"}}
{"description":"id2","payload":{"cleared":"2020-01-31T11:01:54Z","first":"2020-01-31T02:45:23Z","timestamp":"2020-01-31T09:50:47Z","last":"2020-01-31T09:50:47Z"}}
{"description":"id3","payload":{"cleared":"2020-01-31T5:33:54Z","first":"2020-01-31T01:29:23Z","timestamp":"2020-01-31T07:50:47Z","last":"2019-01-31T04:50:47Z"}}

The end goal is,for each line, to get the max of first, cleared and last and update timestamp with max. Then sort all the items by timestamp.Ignore the sorting for now.

I initially jsonified the file to a json file and used the below code:

#!/usr/bin/python
import json as simplejson
from collections import OrderedDict

with open("input.json", "r") as jsonFile:
    data = simplejson.load(jsonFile, object_pairs_hook=OrderedDict)

for x in data:
    maximum = max(x['payload']['first'],x['payload']['cleared'],x['payload']['last'])
    x['payload']['timestamp']= maximum

data_sorted = sorted(data, key = lambda x: x['payload']['timestamp'])

with open("output.json", "w") as write_file:
    simplejson.dump(data_sorted, write_file)

The above code worked for a small test file but the script got killed when I ran it for the 90G file.

I then decided to deal with it line by line using the below code:

#!/usr/bin/python
import sys
import json as simplejson
from collections import OrderedDict

first_arg = sys.argv[1]
data = []

with open(first_arg, "r") as jsonFile:
    for line in jsonFile:
        y = simplejson.loads(line,object_pairs_hook=OrderedDict)

    payload = y['payload']
        first  = payload.get('first', None)
        clearedAt = payload.get('cleared')
        last = payload.get('last')

        lst = [first, clearedAt, last]

        maximum = max((x for x in lst if x is not None))
        y['payload']['timestamp']= maximum
        data.append(y)

with open("jl2json_new.json", "w") as write_file:
    simplejson.dump(data, write_file, indent=4)

It still got killed. So I'm wondering about the best way to approach this problem?

I tried the following approach but it wasn't helpful: https://stackoverflow.com/a/21709058/322541

subzero
  • 33
  • 6
  • For your second approach, try append your data line by line to a new file. – Frank Apr 16 '20 at 20:23
  • I think in both cases you try to read the 90G file into memory. The first one you read the whole file. The second one you read the whole file line by line, but add it to an array containing all 90G lines worth, before closing the file and writing to a new file. Perhaps create a new file and append after each line read. – Sri Apr 16 '20 at 20:23
  • @Frank That's what I initially did but the process got killed! – subzero Apr 16 '20 at 20:28
  • @Sri data is empty at the beginning so I'm not appending to an array containing 90G lines worth. Am I misunderstanding you? – subzero Apr 16 '20 at 20:30
  • See Bruno Bronosky's answer here: https://stackoverflow.com/questions/6475328/how-can-i-read-large-text-files-in-python-line-by-line-without-loading-it-into Read and write line by line. – Frank Apr 16 '20 at 20:33
  • If you want your output file still be a valid JSON, you can append "[" and the beginning, add your data line by line, and then append "]" at the end. – Frank Apr 16 '20 at 20:34
  • @subzero, my mistake. What I meant was that the array you create, while empty at the beginning, eventually holds the entire contents of the 90G file. You need to do this line by line because the file is far too large. – Sri Apr 16 '20 at 20:36
  • As an aside, why use `simplejson` over the `json` module? – AMC Apr 16 '20 at 21:11

2 Answers2

0

You have to go all the way ofyour processing for each line - you parse a line into the y variable, process it, and, instead of writting it to the output file, store it in the data list. Of course you are ending up with all data in memory (which unserialized, from a json string to Python objects would take several hundred gigabytes in memory).

If your code already works for small samples, ust change it to write each line as it goes:

#!/usr/bin/python
import sys
import json as simplejson
from collections import OrderedDict

first_arg = sys.argv[1]


with open(first_arg, "rt") as jsonFile, open("jl2json_new.json", "wt") as write_file:
    for line in jsonFile:
        y = simplejson.loads(line,object_pairs_hook=OrderedDict)

        payload = y['payload']
        first  = payload.get('first', None)
        clearedAt = payload.get('cleared')
        last = payload.get('last')

        lst = [first, clearedAt, last]

        maximum = max((x for x in lst if x is not None))
        y['payload']['timestamp']= maximum
        write_file.write(simplejson.dumps(y) + "\n")
jsbueno
  • 99,910
  • 10
  • 151
  • 209
0

The mmap module allows you to "pin" your memory to the file. This keeps you from reading the whole thing in.

import mmap
import json
from collections import OrderedDict

with open("test.json", "r+b") as f:
    # memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0)

    # read content via standard file methods
    json_dict = json.load(f, object_pairs_hook=OrderedDict)

    print(json_dict)
    # close the map
    mm.close()

This stackoverflow, about reading in chunks of json data at a time, may be another alternative to try out.

gnodab
  • 850
  • 6
  • 15