Reading 4.8 GB Json file in Python

Question

import json

with open("reverseURL.json") as file:
    file2 = json.load(file)

eagle = file2["eagle"]

sky = file2["sky"]

eagleAndSky = set(eagle).intersection(sky)

print(eagleAndSky.pop())

print(eagleAndSky.pop())

I am trying to run this code with a json file that is 4.8 gbs but everytime I run it, it freezes my computer, I don't know what to do. The json file contains tags that are used in photos as keys and for attributes they are image urls that contain that tag. The program works when I run it on the json file created from the test and validation set since they are small but when i run it on the json file from the training set, it freezes my computer since that file is huge like 4.8gb.

You want to parse a stream for JSON "sub-" or "child" objects. In other words, you don't keep the entire JSON object in memory, but only smaller pieces of it. You do your set operations on the smaller child object: https://stackoverflow.com/a/7795029/19410 — Alex Reynolds, Sep 04 '17 at 23:21
Your computer probably isn't freezing; it is just taking a long time to parse 5GB of data, whose result is going to take far more than 5GB of memory. — chepner, Sep 04 '17 at 23:31
Consider a very simple example: the 3-byte JSON file `"f"` will produce a Python `str` object that occupies 38 bytes of memory. Dictionaries *start* at nearly 300 bytes. — chepner, Sep 04 '17 at 23:34

score 3 · Answer 1 · answered Sep 05 '17 at 00:51

The simplest answer is to get more RAM. Get enough to hold the parsed JSON and you're two sets and you're algorithm will be fast again.

If buying more RAM isn't possible you're going to need to craft an algorithm that isn't as memory hungry. As a first step consider using a steaming JSON parser like ijson. This will allow you to only store in memory the pieces of the file you care about. Assuming you have a lot of duplicates in eagle and sky doing this step alone may reduce your memory usage enough to be quick again. Here's some code to illustrate, you'll have to run pip install ijson to run it:

from ijson import items

eagle = set()
sky = set()
with open("reverseURL.json") as file:
    for o in items(file, "eagle"):
        eagle.update(o)
    # Read the file again
    file.seek(0)
    for o in items(file, "sky"):
        sky.update(o)

eagleAndSky = eagle.intersection(sky)

If using a ijson to parse the json as a steam doesn't get the memory usage down enough you'll have to store your temporary state on disk. Python sqlite3 module is a perfect fit for this type of work. You can create a temporary file database with a table for eagle and a table for sky, insert all the data into each table, add a unique index to remove duplicate data (and to speed up the query in the next step), then join the tables to get your intersection. Here's an example:

import os
import sqlite3
from tempfile import mktemp
from ijson import items

db_path = mktemp(suffix=".sqlite3")
conn = sqlite3.connect(db_path)
c = conn.cursor()
c.execute("create table eagle (foo text unique)")
c.execute("create table sky (foo text unique)")
conn.commit()

with open("reverseURL.json") as file:
    for o in items(file, "eagle.item"):
        try:
            c.execute("insert into eagle (foo) values(?)", o)
        except sqlite3.IntegrityError:
            pass  # this is expected on duplicates
    file.seek(0)
    for o in items(file, "sky.item"):
        try:
            c.execute("insert into sky (foo) values(?)", o)
        except sqlite3.IntegrityError:
            pass  # this is expected on duplicates

conn.commit()

resp = c.execute("select sky.foo from eagle join sky on eagle.foo = sky.foo")
for foo, in resp:
    print(foo)

conn.close()
os.unlink(db_path)

Reading 4.8 GB Json file in Python

1 Answers1