1

I am trying to read a large JSON file (~ 2GB) in python.

The following code works well on small files but doesn't work on large files because of MemoryError on the second line.

in_file = open(sys.argv[1], 'r')
posts = json.load(in_file)

I looked at similar posts and almost everyone suggested to use ijson so I decided to give it a try.

in_file = open(sys.argv[1], 'r')
posts = list(ijson.parse(in_file))

This handled reading the big file size but ijson.parse didn't return a JSON object like json.load does so the rest of my code didn't work

TypeError: tuple indices must be integers or slices, not str

If I print out "posts" when using json.load, the o/p looks like a normal JSON

[{"Id": "23400089", "PostTypeId": "2", "ParentId": "23113726", "CreationDate": ... etc

If I print out "posts" after using ijson.parse, the o/p looks like a hash map

[["", "start_array", null], ["item", "start_map", null], 
 ["item", "map_key", "Id"], ["item.Id", "string ... etc

My question: I don't want to change the rest of my code so I am wondering if there is anyway to convert the o/p of ijson.parse(in_file) back to a JSON object so that it's exactly the same as if we are using json.load(in_file)?

Community
  • 1
  • 1
  • No. The amount of RAM is your problem - if you cannot load the structure into memory at once, you cannot have the similar interface, even if you used ijson. – Antti Haapala -- Слава Україні Jan 01 '17 at 23:49
  • is it be possible to read the file line by line to an object to work around the RAM issue? – Course Attendance Jan 02 '17 at 04:26
  • Is this python 3 or python 2, out of curiousity? I'm curious if 3 might have some optimizations in the json encoder or the interpreters memory management. – Chris Larson Jan 02 '17 at 04:57
  • It's python 3. I've been trying for a couple of days now to solve this issue. I almost tried all posted solutions on different forums but still can't get the o/p of ijson.parse(in_file) to convert to the same structure of json.load(in_file). – Course Attendance Jan 02 '17 at 05:15
  • Have you experimented at all with `io.StringIO`, or any of the `io` streaming interfaces? Particularly the Buffered Streams stuff in https://docs.python.org/3/library/io.html? The json decoder docs at https://docs.python.org/3/library/json.html show an example in the first examples of doing so. Using the lines: `from io import StringIO` , `io = StringIO()` , `json.dump(['streaming API'], io)` , `io.getvalue()`. It looks like you can set the buffer size for reading, which might let you get around the filesize issue. – Chris Larson Jan 02 '17 at 05:27
  • I mention this because I get the impression you have no problem digging in and figuring out stuff. This might be worth researching. – Chris Larson Jan 02 '17 at 05:29
  • Thanks for the suggestion Chris. I tried json.dump and the o/p has the same hash interface with extra double quotes and slashes "[[\"\", \"start_array\", null], [\"item\", \"start_map\", null], [\"item\", \"map_key\", \"Id\"] ... – Course Attendance Jan 02 '17 at 19:33
  • You bet. Can you post an MCVE including the contents of a small sample data file? I have a thought I'd like to try, but I'd prefer to try it using your code in complete but minimal form. (If the term `MCVE` is unfamiliar, see: http://stackoverflow.com/help/mcve) – Chris Larson Jan 02 '17 at 22:02
  • Also, it's weird that your output has the form of `[["", "start_array", null], ["item", "start_map", null], ["item", "map_key", "Id"]]` In my attempts, I get a list of tuples, not lists. (Also, for what it's worth, don't think of this as a hash map. I'm guessing you have a background in java. In python, this is a list of lists. The `hash map` terminology will limit your searches of python discussions. If you already knew this, my apologies if that sounded condescending. – Chris Larson Jan 02 '17 at 22:12
  • Using the two lines you posted, my output looks like: `[('', 'start_array', None), ('item', 'start_map', None), ('item', 'map_key', 'Id), ... ]` Note the `tuples`, the conversion to single-quotes, and the conversion of `null` to `None`. Python will balk at the bare `null`, considering it an undefined variable name, if you try to use this data outside of ijson methods. I'm not at all sure why your output looks like that as opposed to mine. I'm taking in a pure json file and assume you are as well. – Chris Larson Jan 02 '17 at 22:22
  • MCVE of mine is composed of the lines: `import ijson` , `filename = ''` , `with open(filename, 'r') as json_data_file:` , ` posts = list(ijson.parse(json_data_file))` , ` print(posts)`. I doubt the `with open(` thing would make any difference here. All that does is assign the variable name `json_data_file` to the file object, and then ensure the file is closed after assigning the data to `posts`, and that last bit is the same thing your code is doing. It's a puzzle to me why the tuple-quote-None thing is happening on my end and not yours. – Chris Larson Jan 02 '17 at 22:32
  • I finally found out the problem. It's not a RAM issue and I didn't need to use any extra libraries. It turns out that I was using python 32 on a 64 machine. Once I installed python 64, the json.load works just fine and I was able to load the big JSON file with no issues. Thanks a lot Chris for your help. – Course Attendance Jan 03 '17 at 01:33
  • Ha! Excellent. It's surprising sometimes how much time I spend trying to solve a problem on my machine that turns out not to exist. Glad you got it worked out. And you're welcome. – Chris Larson Jan 03 '17 at 18:41
  • On a side note, I followed up with the author of bigjson regarding that bug. Turns out he just put that title in there thinking no one else would be looking at his library. All it actually refers to is that there is an upper memory limit on the size of strings used as values and keys, though none on arrays and files. He's retitled the issue appropriately. – Chris Larson Jan 03 '17 at 18:44
  • You can use *ijson.items* instead of *ijson.parse* . The official example above Github is very clear. [Here is the link where you can get it from the README](https://github.com/isagalaev/ijson) – foli Jun 30 '17 at 01:46

1 Answers1

0

Maybe this works for you:

in_file = open(sys.argv[1], 'r')
posts = []
data = ijson.items(in_file, 'item')
for post in data:
    posts.append(post)
flashback
  • 383
  • 3
  • 6