11

I am working on a process that will likely end up attempting to serialize very large json arrays to file. So loading up the entire array in memory and just dumping to file won't work. I need to stream the individual items to file to avoid out of memory issues.

Surprisingly, I can't find any examples of doing this. The code snippet below is something I've cobbled together. Is there a better way to do this?

first_item = True
with open('big_json_array.json', 'w') as out:
     out.write('[')
     for item in some_very_big_iterator:
          if first_item:
               out.write(json.dumps(item))
               first_item = False
          else:
               out.write("," + json.dumps(item))
     out.write("]")
martineau
  • 119,623
  • 25
  • 170
  • 301
user163757
  • 6,795
  • 9
  • 32
  • 47
  • 1
    And presumably you will need to load the data from the file at a later date so will hit the same issues. I guess this is why the newline json format was created – roganjosh Sep 02 '18 at 13:48

1 Answers1

17

While your code is reasonable, it can be improved upon. You have two reasonable options, and an additional suggestion.

Your options are:

  • To not generate an array, but to generate JSON Lines output. For each item in your generator this writes a single valid JSON document without newline characters into the file, followed by a newline. This is easy to generate with the default json.dump(item, out) configuration followed by a `out.write('\n'). You end up with a file with a separate JSON document on each line.

    The advantages are that you don't have to worry about memory issues when writing or when reading the data again, as you'd otherwise have bigger problems on reading the data from the file later on; the json module can't be made to load data iteratively, not without manually skipping the initial [ and commas.

    Reading JSON lines is simple, see Loading and parsing a JSON file with multiple JSON objects in Python

  • Wrap your data in a generator and a list subclass to make json.dump() accept it as a list, then write the data iteratively. I'll outline this below. Take into account that you now may have to solve the problem in reverse, reading the JSON data again with Python.

My suggestion is to not use JSON here. JSON was never designed for large-scale data sets, it's a web interchange format aimed at much smaller payloads. There are better formats for this sort of data exchange. If you can't deviate from JSON, then at least use JSON lines.

You can write a JSON array iteratively using a generator, and a list subclass to fool the json library into accepting it as a list:

class IteratorAsList(list):
    def __init__(self, it):
        self.it = it
    def __iter__(self):
        return self.it
    def __len__(self):
        return 1

with open('big_json_array.json', 'w') as out:
    json.dump(IteratorAsList(some_very_big_iterator), out)

The IteratorAsList class satisfies two tests that the json encoder makes: that the object is a list or subclass thereof, and that it has a length greater than 0; when those conditions are met it'll iterate over the list (using __iter__) and encode each object.

The json.dump() function then writes to the file as the encoder yields data chunks; it'll never hold all the output in memory.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • _There are better formats for this sort of data exchange._ Would you care to point out some better alternatives? – Adrian Martin Sep 10 '18 at 15:40
  • @AdrianMartin: I can't give any sensible recommendations without more details on what the use cases are. – Martijn Pieters Sep 10 '18 at 15:48
  • Thanks @MartijnPieters. I was just curious. I'll ask a proper question, then. – Adrian Martin Sep 10 '18 at 20:06
  • Hey, bit old answer but I have a question regarding this, I was looking for something like this, saving to file without saving to memory. I tried to implement but I'm not sure if I'm using it right. Here's my code, https://hasteb.in/uyojamam.kotlin , but this still saves into memory, right?| Ideally I'd like to save program_json and just overwrite it next time, without having huge list. – Simeon Aleksov Dec 01 '20 at 13:42
  • @SimeonAleksov: sorry, that's not something I have time to help with directly. You could try posting a question about it here on Stack Overflow? – Martijn Pieters Dec 01 '20 at 17:29