I found the json-stream package which might be able to help. While it does provide the mechanics for stepping over the input JSON and streaming Python data structures to an output JSON file, without concrete details from OP it's hard to say if this would have met their needs.
Just to see if it actually has any memory advantage in processing large files, I've mocked up this basic JSON:
{
"0": {"foo": "bar"},
"1": {"foo": "bar"},
"2": {"foo": "bar"},
"3": {"foo": "bar"},
...
up to 10M objects:
...
"9999997": {"foo": "bar"},
"9999998": {"foo": "bar"},
"9999999": {"foo": "bar"},
}
and I've made up the requirement to change every odd object to {"foo": "BAR"}
:
{
"0": {"foo": "bar"},
"1": {"foo": "BAR"},
"2": {"foo": "bar"},
"3": {"foo": "BAR"},
...
"9999997": {"foo": "BAR"},
"9999998": {"foo": "bar"},
"9999999": {"foo": "BAR"},
}
I'm certain this is more trivial than what OP needed to do by passing an update dict (which I imagine to have a moderately "deep" structure).
I've written scripts to handle the generation, reading, and transforming of some test articles:
generate:
@streamable_dict
def yield_obj(n: int):
for x in range(n):
yield str(x), {"foo": "bar"}
def gen_standard(n: int):
with open(f"gen/{n}.json", "w") as f:
obj = dict(list(yield_obj(n)))
json.dump(obj, f, indent=1)
def gen_stream(n: int):
with open(f"gen/{n}.json", "w") as f:
json.dump(yield_obj(n), f, indent=1)
yield_obj()
is an iterator that can be materialized with dict(list(...))
, and be streamed to the standard json.dump()
method with the help of the @streamable_dict
wrapper.
Makes three test files:
-rw-r--r-- 1 zyoung staff 2.9M Feb 23 17:24 100000.json
-rw-r--r-- 1 zyoung staff 30M Feb 23 17:24 1000000.json
-rw-r--r-- 1 zyoung staff 314M Feb 23 17:24 10000000.json
read, which just loads and passes over everything:
def read_standard(fname: str):
with open(fname) as f:
for _ in json.load(f):
pass
def read_stream(fname: str):
with open(fname) as f:
for _ in json_stream.load(f):
pass
transform, which applies my silly "uppercase every odd BAR":
def transform_standard(fname: str):
with open(fname) as f_in:
data = json.load(f_in)
for key, value in data.items():
if int(key) % 2 == 1:
value["foo"] = "BAR"
with open(out_name(fname), "w") as f_out:
json.dump(data, f_out, indent=1)
def transform_stream(fname: str):
@streamable_dict
def update(data):
for key, value in data.items():
value = json_stream.to_standard_types(value)
if int(key) % 2 == 1:
value["foo"] = "BAR"
yield key, value
with open(fname) as f_in:
data = json_stream.load(f_in)
updated_data = update(data)
with open(out_name(fname), "w") as f_out:
json.dump(updated_data, f_out, indent=1)
@streamable_dict
is used again to turn the update()
iterator into a streamable "thing" that can be passed to the standard json.dump()
method.
The complete code and the runners are in this Gist.
The stats show that json-stream
has a flat memory curve for testing 100_000, 1_000_000, and 10_000_000 objects. It does take more time to read and transform, though:
Generate
Method |
Items |
Real (s) |
User (s) |
Sys (s) |
Mem (MB) |
standard |
1e+05 |
0.19 |
0.17 |
0.01 |
45.84 |
standard |
1e+06 |
2.00 |
1.93 |
0.06 |
372.97 |
standard |
1e+07 |
21.67 |
20.46 |
1.03 |
3480.29 |
stream |
1e+05 |
0.18 |
0.15 |
0.00 |
7.28 |
stream |
1e+06 |
1.43 |
1.41 |
0.02 |
7.69 |
stream |
1e+07 |
14.41 |
14.07 |
0.20 |
7.58 |
Read
Method |
Items |
Real (s) |
User (s) |
Sys (s) |
Mem (MB) |
standard |
1e+05 |
0.05 |
0.04 |
0.01 |
48.28 |
standard |
1e+06 |
0.58 |
0.50 |
0.05 |
390.17 |
standard |
1e+07 |
7.69 |
6.73 |
0.80 |
3875.81 |
stream |
1e+05 |
0.32 |
0.31 |
0.01 |
7.70 |
stream |
1e+06 |
2.96 |
2.94 |
0.02 |
7.69 |
stream |
1e+07 |
29.88 |
29.65 |
0.17 |
7.77 |
Transform
Method |
Items |
Real (s) |
User (s) |
Sys (s) |
Mem (MB) |
standard |
1e+05 |
0.19 |
0.17 |
0.01 |
48.05 |
standard |
1e+06 |
1.83 |
1.75 |
0.07 |
388.83 |
standard |
1e+07 |
20.16 |
19.15 |
0.91 |
3875.49 |
stream |
1e+05 |
0.63 |
0.61 |
0.01 |
7.61 |
stream |
1e+06 |
6.06 |
6.02 |
0.03 |
7.92 |
stream |
1e+07 |
61.44 |
60.89 |
0.35 |
8.44 |