2

I would like to load one by one the items of my json file. The file could be up to 3gb so loading it in advance and looping over it is not an option.

My json file is basically a dictionary of key and value pairs (hundreds of pairs), and there is nothing I want to discard (ijson).

I just want to load one pair at a time to work with it. Is there anyway to do that?

BlueMountain
  • 197
  • 2
  • 17

3 Answers3

3

So basically I found out in this answer how to do it in a much simple way: https://stackoverflow.com/a/17326199/2933485

Using ijson, it looks like you can loop over the file without loadin it but opening the file and using ijson parse function over it, this is the example I found:

import ijson

for prefix, the_type, value in ijson.parse(open(json_file_name)):
     print prefix, the_type, value
BlueMountain
  • 197
  • 2
  • 17
1

OK, so json is a nested format, which means each repeating block (dict or list object) is surrounded by start and end characters. Normally, you read the entire file, and in doing so, can confirm the well-formed, structure and "closedness" of each object - in other words, it's verifiable that all objects are legally structured. When you load a json file into memory using the json library, part of that process is the validation.

If you want to do that for an extra large file - you have to forgo the normal library and roll your own, loading in a line (or chunk) at a time, and processing that under the assumption that validation will retrospectively succeed.

That's achievable (assuming you're able to put your faith in such an assumption) but it's probably something you'll have to write yourself.

One strategy might be to read a line at a time, splitting on the colon : character, with commas as record delimiters, which is a crude approximation of how key-value pairs are coded within json. Following this method, you're going to be able to process all but the first and final key-value pairs cleanly in sequence.

That just leaves you to write some special conditions for properly parsing the first and final records, which will come through garbled using this strategy.

Crudely then, call something like this (referencing the csv library) and treat the json like a massive, unusually formatted csv file.

import csv
with open('big.json', newline=',') as csv_json_franken_file:
    jsonreader = csv.reader(csv_json_franken_file, delimiter=':', quotechar='"')
    for row in jsonreader: # This bit reads in a "row" at a time, until finished
        print(', '.join(row))

Then do some edge-case treatment of the first and last rows (more or less depending on the structure of your json) to repair the garbling caused by what is a fairly blatant hack. It's not clean, and it's not robust to changes in the content - but sometimes, you just have to play the hand you've been dealt.

To be honest, generating json files of 3GB in size is a little irresponsible, so if anyone comes asking, you've got that in your corner.

Thomas Kimber
  • 10,601
  • 3
  • 25
  • 42
1

Why dont you populate a sqlite table with the data once and query the data using the record PK? See https://docs.python.org/3.7/library/sqlite3.html

balderman
  • 22,927
  • 7
  • 34
  • 52
  • If some records are nested in the JSON data, this could be problem, as you could then get a List cast as string in one of SQL fields for example. If you want to work with the raw data directly, this may not be good idea. – Will Croxford Feb 28 '23 at 13:09