4

I am new to python. So please excuse me if I am not asking the questions in pythonic way.

My requirements are as follows:

  1. I need to write python code to implement this requirement.

  2. Will be reading 60 json files as input. Each file is approximately 150 GB.

  3. Sample structure for all 60 json files is as shown below. Please note each file will have only ONE json object. And the huge size of each file is because of the number and size of the "array_element" array contained in that one huge json object.

    { "string_1":"abc", "string_1":"abc", "string_1":"abc", "string_1":"abc", "string_1":"abc", "string_1":"abc", "array_element":[] }

  4. Transformation logic is simple. I need to merge all the array_element from all 60 files and write it into one HUGE json file. That is almost 150GB X 60 will be the size of the output json file.

Questions for which I am requesting your help on:

  1. For reading: Planning on using "ijson" module's ijson.items(file_object, "array_element"). Could you please tell me if ijson.items will "Yield" (that is NOT load the entire file into memory) one item at a time from "array_element" array in the json file? I dont think json.load is an option here because we cannot hold such a huge dictionalry in-memory.

  2. For writing: I am planning to read each item using ijson.item, and do json.dumps to "encode" and then write it to the file using file_object.write and NOT using json.dump since I cannot have such a huge dictionary in memory to use json.dump. Could you please let me know if f.flush() applied in the code shown below is needed? To my understanding, the internal buffer will automatically get flushed by itself when it is full and the size of the internal buffer is constant and wont dynamically grow to an extent that it will overload the memory? please let me know

  3. Are there any better approach to the ones mentioned above for incrementally reading and writing huge json files?

Code snippet showing above described reading and writing logic:

for input_file in input_files:
    with open("input_file.json", "r") as f:
         objects = ijson.items(f, "array_element")
         for item in objects:
              str = json.dumps(item, indent=2)
              with open("output.json", "a") as f:
                   f.write(str)
                   f.write(",\n")
                   f.flush()
    with open("output.json", "a") as f:
        f.seek(0,2)
        f.truncate(f.tell() - 1)
        f.write("]\n}")

Hope I have asked my questions clearly. Thanks in advance!!

skp
  • 43
  • 5
  • Regarding 1., just try it. Regarding 2., the `flush` is not necessary. The file will be closed at the end of the `with` block and thus all Python buffers are flushed. Regarding 3., you have not specified, what "better" means. Answers will be subjective, see [Subjective question on Stack Overflow](https://meta.stackexchange.com/questions/28559/). – H. Rittich Oct 19 '21 at 16:29

1 Answers1

1

The following program assumes that the input files have a format that is predictable enough to skip JSON parsing for the sake of performance.

My assumptions, inferred from your description, are:

  • All files have the same encoding.
  • All files have a single position somewhere at the start where "array_element":[ can be found, after which the "interesting portion" of the file begins
  • All files have a single position somewhere at the end where ]} marks the end of the "interesting portion"
  • All "interesting portions" can be joined with commas and still be valid JSON

When all of these points are true, concatenating a predefined header fragment, the respective file ranges, and a footer fragment would produce one large, valid JSON file.

import re
import mmap

head_pattern = re.compile(br'"array_element"\s*:\s*\[\s*', re.S)
tail_pattern = re.compile(br'\s*\]\s*\}\s*$', re.S)

input_files = ['sample1.json', 'sample2.json']

with open('result.json', "wb") as result:
    head_bytes = 500
    tail_bytes = 50
    chunk_bytes = 16 * 1024

    result.write(b'{"JSON": "fragment", "array_element": [\n')

    for input_file in input_files:
        print(input_file)

        with open(input_file, "r+b") as f:
            mm = mmap.mmap(f.fileno(), 0)
            
            start = head_pattern.search(mm[:head_bytes])
            end = tail_pattern.search(mm[-tail_bytes:])

            if not (start and end):
                print('unexpected file format')
                break

            start_pos = start.span()[1]
            end_pos = mm.size() - end.span()[1] + end.span()[0]

            if input_files.index(input_file) > 0:
                result.write(b',\n')

            pos = start_pos
            mm.seek(pos)
            while True:
                if pos + chunk_bytes >= end_pos:
                    result.write(mm.read(end_pos - pos))
                    break
                else:
                    result.write(mm.read(chunk_bytes))
                    pos += chunk_bytes

    result.write(b']\n}')

If the file format is 100% predictable, you can throw out the regular expressions and use mm[:head_bytes].index(b'...') etc for the start/end position arithmetic.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • Yes I can safely assume that file format is always predictable. Thanks very much for the information. I will work on understanding the code you provided. – skp Oct 20 '21 at 00:39
  • 1
    @skp I've tested it on small files. and it worked correctly. The loop that is reading and writing the files in chunks is probably fast enough, especially for a one-off in an I/O-bound situation where you typically can read bytes faster then you can write them anyway. Play around with the chunk size and see if it makes any difference. But *"How to quickly copy defined ranges from large files in Python/with shell tools"* might be an interesting secondary question to ask here. Something like [`dd` might be faster than Python](https://unix.stackexchange.com/a/543615/2579). – Tomalak Oct 20 '21 at 06:05
  • 1
    @skp Also, it might be worthwhile [to pass the target file through `gzip`](https://stackoverflow.com/q/39450065/18771) as you're writing it. JSON typically compresses extremely well and `gzip` should not impact reading/writing performance measurably. – Tomalak Oct 20 '21 at 06:08
  • @skp Did it work? – Tomalak Oct 27 '21 at 13:31
  • Sorry for the delayed response. One thing I noticed and would like to discuss with you is that, the `tail_pattern` is not matching these characters - `]}` that are at the EOF. Within the `tail_bytes` of 50 bytes, there are couple of other instances where this pattern `]}` occurs and that is because, the `array_element` itself contains objects that are of type array (Sorry I missed to mention this in my first post). So was wondering if we can specifically match this pattern - `]}` only at the very end of the file? Please share your thoughts and recommendations. Thanks very much!! – skp Oct 29 '21 at 17:06
  • @skp That's strange - the `$` in the end pattern actually is there to make sure that it only matches at the very end of the file. And of course you can change the number of `tail_bytes`, that's why this is configurable. – Tomalak Oct 29 '21 at 18:52
  • Ahh, I forgot to escape the } in the second regex. See https://regex101.com/r/NdR0TW/2 - works as intended. – Tomalak Oct 29 '21 at 19:07
  • Something strange. Used regex101.com link you specified and tested it. It works perfectly. But when I run it on my program, it still fails to pick the pattern at the EOF. Not sure if I am missing something here. Here is the json I am testing: `{ "s1":"abc", "s2":"abc", "s3":"abc", "s4":"abc", "s5":"abc", "s6":"abc", "array_element": [ {"n_c1":"b"}, { "n_c2":"b", "n_c3":[ {"p1":[111,222,333]}, {"p2":[444,555,666]} ] } ] }` – skp Oct 31 '21 at 11:53
  • Here are relevant code snippets and output: `tail_pattern = re.compile(b'\s*]\s*\}\s*$', re.S) tail_bytes = 50 mm = mmap.mmap(f.fileno(), 0) start = head_pattern.search(mm[:head_bytes]) end = tail_pattern.search(mm[tail_bytes:]) print("Total size in bytes of the file I am searching the patterns on:", mm.size()) print("head_pattern_span:", start) print("tail_pattern_span:", end) start_pos = start.span()[1] end_pos = mm.size() - end.span()[1] + end.span()[0] print("value of end_pos variable in the code: ", end_pos)` – skp Oct 31 '21 at 11:56
  • Here are the output from the print statements in above code snippet. Based on my understanding of "span" method, I was expecting `tail_pattern_span` to show `(294, 300)`. But it is picking the pattern at `(244, 250)`. Could you please share your thoughts? `Total size in bytes of the file I am searching the patterns on: 300 head_pattern_span: tail_pattern_span: value of end_pos variable in the code: 294` – skp Oct 31 '21 at 12:00
  • @skp Yep, my bad. Look at the [edit #4](https://stackoverflow.com/posts/69635440/revisions) to see what was wrong. (Although you could have caught that yourself. You're supposed to scrutinize any code you get from the Internet, Stack Overflow is no exception.) – Tomalak Oct 31 '21 at 14:20
  • This works perfectly!! Thanks very much for all your patience and support!! – skp Oct 31 '21 at 18:40
  • @skp You're welcome. ;) If you feel like it you can try to gzip the whole thing, as I suggested, just to save some disk space. – Tomalak Oct 31 '21 at 19:06
  • Sure, I will look into gzipping it. Thank you! – skp Nov 02 '21 at 18:55