116

I have some large json encoded files. The smallest is 300MB; the rest are multiple GB, anywhere from around 2GB to 10GB+.

I seem to run out of memory when trying to load the files in Python.

I tried using this code to test performance:

from datetime import datetime
import json

print datetime.now()

f = open('file.json', 'r')
json.load(f)
f.close()

print datetime.now()

Not too surprisingly, this causes a MemoryError. It appears that json.load() calls json.loads(f.read()), which is trying to dump the entire file into memory first, which clearly isn't going to work.

How I can solve this cleanly?


I know this is old, but I don't think this is a duplicate. While the answer is the same, the question is different. In the "duplicate", the question is how to read large files efficiently, whereas this question deals with files that won't even fit in to memory at all. Efficiency isn't required.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
Tom Carrick
  • 6,349
  • 13
  • 54
  • 78
  • Similar if not the same question: http://stackoverflow.com/questions/2400643/is-there-a-memory-efficient-and-fast-way-to-load-big-json-files-in-python – tskuzzy Apr 30 '12 at 10:39
  • The issue is that if the JSON file is one giant list (for example), then parsing it into Python wouldn't make much sense without doing it all at once. I guess your best bet is to find a module that handles JSON like SAX and gives you events for starting arrays and stuff, rather than giving you objects. Unfortunately, that doesn't exist in the standard library. – Gareth Latty Apr 30 '12 at 10:40
  • Well, I kind of want to read it in all at once. One of my potential plans is to go through it once and stick everything in a database so I can access it more efficiently. – Tom Carrick Apr 30 '12 at 10:45
  • If you can't fit the entire file as text into memory, I sincerely doubt you'll fit the entire file as Python objects into memory. If you want to put it in a database, my answer could be helpful. – Gareth Latty Apr 30 '12 at 10:46
  • For any non-trivial task processing of json files such sizes can easy take weeks or months. – yazu Apr 30 '12 at 10:50
  • I have been going through and fixing old duplicate closures to use the new system. I agree with the closure here: while the stated requirements are somewhat different, it's clear from reading both Q&As that the cause and solution are the same. – Karl Knechtel Jan 14 '23 at 10:10

1 Answers1

117

The issue here is that JSON, as a format, is generally parsed in full and then handled in-memory, which for such a large amount of data is clearly problematic.

The solution to this is to work with the data as a stream - reading part of the file, working with it, and then repeating.

The best option appears to be using something like ijson - a module that will work with JSON as a stream, rather than as a block file.

Edit: Also worth a look - kashif's comment about json-streamer and Henrik Heino's comment about bigjson.

Gareth Latty
  • 86,389
  • 17
  • 178
  • 183
  • Thanks! I'll check that out in hopes that I don't have to resort to using Java. – Tom Carrick Apr 30 '12 at 10:51
  • Yeah I think I'll end up using Java but this is the answer to the question I actually asked. Thanks again. – Tom Carrick Apr 30 '12 at 13:01
  • 13
    I found that ijson requires complete json before it will stream it - I would have preferred something that can work with partial json as it becomes available. Couldn't find anything so wrote my own, its called [jsonstreamer](https://github.com/kashifrazzaqui/json-streamer) and is available at [github](https://github.com/kashifrazzaqui/json-streamer) and at the cheeseshop – keios Dec 19 '14 at 11:38
  • On the github page I don't see anything about using an external file. Would it be possible to "open" a file and load a json file with this? – Jeremy Feb 06 '15 at 04:51
  • 2
    @JeremyCraigMartinez It looks like you just need to do a `with open(some_file) as file: for line in file: streamer.consume(line)`. It'd be a nice thing to have a convenience method (or better yet, a context manager) for (and note that for the use case at hand, relying on line breaks being reasonable is probably a bad idea - reading the file in blocks by size is probably the better option). – Gareth Latty Feb 06 '15 at 12:46
  • 1
    @JeremyCraigMartinez As mentioned by Lattyware it should be fairly simple to do this. I can add context management support and will probably do so in the next release. if you require any other features or spot bugs you are more likely to catch my attention by making a github issue. – keios Feb 12 '15 at 17:34
  • 7
    I also wrote a lib that can open JSON files of any size. My lib loads an object that acts like regular dict or array, but in reality it loads more stuff only when required. You can find if from [Github](https://github.com/henu/bigjson). – Henrik Heino Aug 06 '16 at 13:43
  • @HenrikHeino bigjson looks like the best option, but it's unusable in Python 3. – orluke Jul 02 '19 at 19:54
  • 1
    @orluke It should now work on Python 3 :) – Henrik Heino Jul 04 '19 at 04:22
  • @HenrikHeino: Can I create custom en- and decoding in your library such as doing with the cls= argument for the ordinary json package? I need it for writing and reading numpy arrays and other custom data types. Thx! – gilgamash Oct 28 '20 at 09:15
  • @HenrikHeino hi there tried your lib bigjson ``` $ poetry run python import_pdl.py Traceback (most recent call last): File "import_pdl.py", line 15, in element = j[4] raise TypeError(u'Key must be string!') ``` – dmh Aug 12 '21 at 05:54
  • 2
    @HenrikHeino; thanks for building the `bigjson` lib. My json file includes multiple objects and the `json.load` or `bigjson.load` seem to work for json files including a single object. For json files with multiple objects, we can use `json.loads`, but the `module bigjson has no attribute loads`. Could u pls help me with this? – mOna Aug 16 '21 at 10:43
  • @HenrikHeino, I am also in need of a loads method. Also, the bijson works on json file over 1GB, however the returned object is ALL bigjson and no more dict etc.... – MMEL May 25 '22 at 07:22
  • @HenrikHeino where can I find complet and very detailed examples using your library, I could only find just 1 simple example on your github page. What I would like to do is to convert a bigjson Object into a dict, is that even possible. When I first load the file, the result is a BJ Array, but how do deal with objects in that array? what are the methods that can we use to extract data from these objects? and can we convert the m to dict? Thank you. – MMEL May 26 '22 at 05:32
  • @HenrikHeino do you know a way to deal with single line huge jsons – Chetan_Vasudevan Oct 07 '22 at 20:23
  • You can try JSONBuddy https://www.json-buddy.com if a Windows desktop and command-line tool is ok for you. – Clemens Dec 13 '22 at 10:05
  • zq is a tool that can manipulate large json objects no matter the size. You can even manipulate the json. Might not even need python. https://zed.brimdata.io/docs/install `zq -i json 'count()' huge.json` – James Kerr Feb 27 '23 at 22:19