0

I want to (pre)process large JSON files (5-10GB each), which contain multiple root elements. These root elements follow each other without separator like this: {}{}....

So I first wrote the following simple code to get a valid JSON File:

with open(file) as f: 
    file_data = f.read()
    file_data = file_data.replace("}{", "},{") 
    file_data = "[" + file_data + "]"
    df = pd.read_json(file_data)

Obviously this doesn´t work with large files. Even the 400MB file doesn´t work. (I´ve got 16GB memory)

I´ve read that it´s possible to work with chunks but I don´t manage to get this in ''chunk logic'' Is there a way to ''chunkenize'' this?

I am glad for you help.

johnhenry
  • 1
  • 1
  • Why are there multiple root elements in the first place? Why not put the JSON array into the file? – Barmar Sep 27 '19 at 17:27
  • This is coming from the data source, I have no influence on that. – johnhenry Sep 27 '19 at 17:32
  • You should tell them that this is unprocessable, it doesn't have a reliable way to delimit the data. Your method won't work if there are any strings in the JSON data that contain `}{`. – Barmar Sep 27 '19 at 17:34

1 Answers1

0

I am having a hard time visualizing the multiple root element idea, but you should write the file_data contents to disk and try reading it in separately. If you have the file open it will consume RAM in addition to having the RAM consumed by the file_data object (and possibly even the modified object, though that's a garbage collector question. I think garbage collection gets done after the function returns.) Try using f.close explicitly instead of the with and return that from a separate function.

  • How am I supposed to replace the 'with' with 'f.close'? – johnhenry Sep 28 '19 at 00:11
  • I found the (https://stackoverflow.com/questions/27907633/multiple-json-objects-in-one-file-extract-by-python). Look at the last answer , this looks very promising. But I can´t apply this to a file which comes from the disk. I get an error: JSONDecodeError: Expecting value: line 1 column 1 (char 0)... Any suggestions? – johnhenry Sep 28 '19 at 00:15