I am trying to load JSON files that are too big for json.load
. I have spent a while looking into ijson
and many stack overflow posts, and used the following code, mostly stolen from https://stackoverflow.com/a/58148422/11357695 :
def extract_json(filename):
listJ=[]
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'records.item', use_float=True)
jsons = (o for o in jsonobj)
for j in jsons:
listJ.append(j)
return listJ
My JSON file is read in as a dict, with 6 keys, one of which is 'records'
. The above function only replicates the contents of this 'records'
key's value. I looked into this a bit more and came to the conclusion that ijson.items
uses a prefix ('records.item'
). So it's not surprising it's only replicating this key's value. But I'd like to get everything.
To achieve this, I looked at using ijson.parse
to give a list of prefixes. When I fed all of the prefixes made by the weird generator parser
object below into ijson.items()
using an iterative loop, I got a MemoryError
pretty quickly from the json.items()
statement. I also got IncompleteJSONError
in earlier iterations of the code, which does not appear with the current version. However, if I remove the except ijson.IncompleteJSONError
statement I get a Memory Error
:
def loadBigJsonBAD(filename):
with open(filename, 'rb') as input_file:
parser = ijson.parse(input_file)
prefixes=[]
for prefix , event, value in parser:
prefixes.append(prefix)
listJnew=[]
with open(filename, 'rb') as input_file:
for prefix in prefixes:
jsonobjn = ijson.items(input_file, prefix, use_float=True)
try:
jsonsn = (o for o in jsonobjn)
for jn in jsonsn:
listJnew.append(jn)
except ijson.IncompleteJSONError:
continue
return listJnew
I tried what would happen if I just searched for prefixes without 'record'
, to see if this would at least give me the rest of the dictionary. However, it actually worked perfectly and made a list whose first object is the same as the object generated for json.load
(which worked in this case as I was using a small file to test the code):
def loadBigJson(filename):
with open(filename, 'rb') as input_file:
parser = ijson.parse(input_file)
prefixes=[]
for prefix , event, value in parser:
if prefix[0:len('records')] != 'records':
prefixes.append(prefix)
listJnew=[]
with open(filename, 'rb') as input_file:
for prefix in prefixes:
jsonobjn = ijson.items(input_file, prefix, use_float=True)
try:
jsonsn = (o for o in jsonobjn)
for jn in jsonsn:
listJnew.append(jn)
except ijson.IncompleteJSONError:
continue
return listJnew
When this is tested:
path_json=r'C:\Users\u03132tk\.spyder-py3\antismashDB\GCF_010669165.1\GCF_010669165.1.json'
extractedJson=extract_json(path_json) #extracts the 'records' key value
loadedJson=json.load(open(path_json, 'r')) #extracts entire json file
loadedJsonExtracted=loadedJson['records'] #the thing i am using to compare to the extractedJson item
bigJson=loadBigJson(path_json) #a list whose single object is the same as loaded json.
print (bigJson[0]==loadedJson)#True
print (bigJson[0]['records']==loadedJsonExtracted)#True
print (bigJson[0]['records']==extractedJson)#True
This is great, but it highlights that I don't really understand what's going on - why is the records
prefix necessary for the the extract_json
function (I tried the other keys in the json dictionary, there were no hits) but counterproductive for loadBigJson
? What is generating the Error statements and why does an except IncompleteJSONError
statement prevent a MemoryError
?
As you can tell I'm pretty unfamiliar with working with JSONs, so any general tips/clarifications would also be great.
Thanks for reading the novel, even if you don't have an answer!
Tim