6

I have been trying to write large amount (>800mb) of data to JSON file; I did some fair amount of trial and error to get this code:

def write_to_cube(data):
    with open('test.json') as file1:
        temp_data = json.load(file1)

    temp_data.update(data)

    file1.close()

    with open('test.json', 'w') as f:
        json.dump(temp_data, f)

        f.close()

to run it just call the function write_to_cube({"some_data" = data})

Now the problem with this code is that it's fast for the small amount of data, but the problem comes when test.json file has more than 800mb in it. When I try to update or add data to it, it takes ages.

I know there are external libraries such as simplejson or jsonpickle, I am not pretty sure on how to use them.

Is there any other way to this problem?

Update:

I am not sure how this can be a duplicate, other articles say nothing about writing or updating a large JSON file, rather they say only about parsing.

Is there a memory efficient and fast way to load big json files in python?

Reading rather large json files in Python

None of the above resolve this question a duplicate. They don't say anything about writing or update.

Community
  • 1
  • 1
Akshay
  • 2,622
  • 1
  • 38
  • 71
  • 1
    Possible duplicate of [is-there-a-memory-efficient-and-fast-way-to-load-big-json-files-in-python](http://stackoverflow.com/questions/2400643/is-there-a-memory-efficient-and-fast-way-to-load-big-json-files-in-python) – BPL Sep 06 '16 at 00:36
  • 1
    Basically JSON is not a good choice of format when it comes to the serialization of a large amounts of data. – Klaus D. Sep 06 '16 at 00:39
  • Possible duplicate of [Reading rather large json files in Python](http://stackoverflow.com/questions/10382253/reading-rather-large-json-files-in-python) – Sreejith Menon Sep 06 '16 at 00:53
  • @BPL But none of them say about writing large data or updating them. – Akshay Sep 06 '16 at 02:39
  • Your question has been asked and answered in the below threads: http://stackoverflow.com/questions/10382253/reading-rather-large-json-files-in-python and http://stackoverflow.com/questions/2400643/is-there-a-memory-efficient-and-fast-way-to-load-big-json-files-in-python – Sreejith Menon Sep 06 '16 at 00:52
  • @SreejithMenon But that's only to parse not to write and update. How can it be a duplicate? – Akshay Sep 06 '16 at 02:43
  • @KlausD. What do you recommend? – Akshay Sep 06 '16 at 02:50
  • Are you sure the time is being taken in the writing, rather than the reading? How have you proved that to yourself? – GreenAsJade Sep 06 '16 at 03:14
  • @GreenAsJade I have been testing this for over a month now, whenever there is a small amount of data, this procedure works fine. But when you read and try to update it and the write a large file, then it slows down quite a lot. – Akshay Sep 06 '16 at 03:20
  • Sure, but the code you listed above could be slow in reading, or in writing, or in both. Which is it? – GreenAsJade Sep 06 '16 at 03:23
  • @GreenAsJade Both. If the JSON file has large data. – Akshay Sep 06 '16 at 03:24
  • 1
    There are other json encoders on https://pypi.python.org like ultrajson or ujson that tend to be faster than the default implementation. Try them out. They tend to serialize/deserialize to memory then read/write the file so memory overhead is 800mb plus well over a gig for the in-memory object (its size depends on what the data is... it could easily be multiple gigs). The thing is, a json file of that size is going to be slow unless you have a machine will significant RAM and fast storage. You should rethink how this data is being stored. A JSON file for this data seems like a really bad idea. – tdelaney Sep 06 '16 at 03:33
  • To try using simplejson, install it via pipy, then `import simplejson as json` instead of `import json`. – Quan To Sep 06 '16 at 04:30
  • Usually, for doing long operation, I'll just go with threads and callbacks – Quan To Sep 06 '16 at 04:33
  • @tdelaney that's a good option. – Akshay Sep 06 '16 at 04:43
  • @btquanto i'll give it a try – Akshay Sep 06 '16 at 04:43
  • The reason why I marked it a duplicate is because for updating the JSON file you have to first read the data and then update the JSON object and then write it. For both reading and writing the data, the links in my comment should be able to help. – Sreejith Menon Sep 06 '16 at 15:25
  • 2
    @SreejithMenon No, it doesn't. If it had, I wouldn't have even asked this question. It just doesn't make sense. And reading data is only a PART of it. – Akshay Sep 06 '16 at 23:15

2 Answers2

2

I found the json-stream package which might be able to help. While it does provide the mechanics for stepping over the input JSON and streaming Python data structures to an output JSON file, without concrete details from OP it's hard to say if this would have met their needs.

Just to see if it actually has any memory advantage in processing large files, I've mocked up this basic JSON:

{
     "0": {"foo": "bar"},
     "1": {"foo": "bar"},
     "2": {"foo": "bar"},
     "3": {"foo": "bar"},
     ...

up to 10M objects:

    ...
    "9999997": {"foo": "bar"},
    "9999998": {"foo": "bar"},
    "9999999": {"foo": "bar"},
}

and I've made up the requirement to change every odd object to {"foo": "BAR"}:

{
    "0": {"foo": "bar"},
    "1": {"foo": "BAR"},
    "2": {"foo": "bar"},
    "3": {"foo": "BAR"},
    ...
    "9999997": {"foo": "BAR"},
    "9999998": {"foo": "bar"},
    "9999999": {"foo": "BAR"},
}

I'm certain this is more trivial than what OP needed to do by passing an update dict (which I imagine to have a moderately "deep" structure).

I've written scripts to handle the generation, reading, and transforming of some test articles:

  • generate:

    @streamable_dict
    def yield_obj(n: int):
        for x in range(n):
            yield str(x), {"foo": "bar"}
    
    
    def gen_standard(n: int):
        with open(f"gen/{n}.json", "w") as f:
            obj = dict(list(yield_obj(n)))
            json.dump(obj, f, indent=1)
    
    
    def gen_stream(n: int):
        with open(f"gen/{n}.json", "w") as f:
            json.dump(yield_obj(n), f, indent=1)
    

    yield_obj() is an iterator that can be materialized with dict(list(...)), and be streamed to the standard json.dump() method with the help of the @streamable_dict wrapper.

    Makes three test files:

    -rw-r--r--   1 zyoung  staff   2.9M Feb 23 17:24 100000.json
    -rw-r--r--   1 zyoung  staff    30M Feb 23 17:24 1000000.json
    -rw-r--r--   1 zyoung  staff   314M Feb 23 17:24 10000000.json
    
  • read, which just loads and passes over everything:

    def read_standard(fname: str):
        with open(fname) as f:
            for _ in json.load(f):
                pass
    
    
    def read_stream(fname: str):
        with open(fname) as f:
            for _ in json_stream.load(f):
                pass
    
  • transform, which applies my silly "uppercase every odd BAR":

    def transform_standard(fname: str):
        with open(fname) as f_in:
            data = json.load(f_in)
            for key, value in data.items():
                if int(key) % 2 == 1:
                    value["foo"] = "BAR"
            with open(out_name(fname), "w") as f_out:
                json.dump(data, f_out, indent=1)
    
    
    def transform_stream(fname: str):
        @streamable_dict
        def update(data):
            for key, value in data.items():
                value = json_stream.to_standard_types(value)
                if int(key) % 2 == 1:
                    value["foo"] = "BAR"
                yield key, value
    
        with open(fname) as f_in:
            data = json_stream.load(f_in)
            updated_data = update(data)
            with open(out_name(fname), "w") as f_out:
                json.dump(updated_data, f_out, indent=1)
    

    @streamable_dict is used again to turn the update() iterator into a streamable "thing" that can be passed to the standard json.dump() method.

The complete code and the runners are in this Gist.

The stats show that json-stream has a flat memory curve for testing 100_000, 1_000_000, and 10_000_000 objects. It does take more time to read and transform, though:

Generate

Method Items Real (s) User (s) Sys (s) Mem (MB)
standard 1e+05 0.19 0.17 0.01 45.84
standard 1e+06 2.00 1.93 0.06 372.97
standard 1e+07 21.67 20.46 1.03 3480.29
stream 1e+05 0.18 0.15 0.00 7.28
stream 1e+06 1.43 1.41 0.02 7.69
stream 1e+07 14.41 14.07 0.20 7.58

Read

Method Items Real (s) User (s) Sys (s) Mem (MB)
standard 1e+05 0.05 0.04 0.01 48.28
standard 1e+06 0.58 0.50 0.05 390.17
standard 1e+07 7.69 6.73 0.80 3875.81
stream 1e+05 0.32 0.31 0.01 7.70
stream 1e+06 2.96 2.94 0.02 7.69
stream 1e+07 29.88 29.65 0.17 7.77

Transform

Method Items Real (s) User (s) Sys (s) Mem (MB)
standard 1e+05 0.19 0.17 0.01 48.05
standard 1e+06 1.83 1.75 0.07 388.83
standard 1e+07 20.16 19.15 0.91 3875.49
stream 1e+05 0.63 0.61 0.01 7.61
stream 1e+06 6.06 6.02 0.03 7.92
stream 1e+07 61.44 60.89 0.35 8.44
Zach Young
  • 10,137
  • 4
  • 32
  • 53
  • 1
    Inspired by the above statistics, I have made some performance optimisations in the json-stream library. They can be seen in [this PR](https://github.com/daggaz/json-stream/pull/34). By my own measurements they provide about a 28% improvement in run time. I'd appreciate any input you might have. – Jamie Cockburn Mar 02 '23 at 23:41
0

So the problem is that you have a long operation. Here are a few approaches that I usually do:

  1. Optimize the operation: This rarely works. I wouldn't recommend any superb library that would parse the json a few seconds faster
  2. Change your logic: If the purpose is to save and load data, probably you would like to try something else, like storing your object into a binary file instead.
  3. Threads and callback, or deferred objects in some web frameworks. In case of web applications, sometimes, the operation takes longer than a request can wait, we can do the operation in background (some cases are: zipping files, then send the zip to user's email, sending SMS by calling another third party's api...)
Quan To
  • 697
  • 3
  • 10