The simplest answer is to get more RAM. Get enough to hold the parsed JSON and you're two sets and you're algorithm will be fast again.
If buying more RAM isn't possible you're going to need to craft an algorithm that isn't as memory hungry. As a first step consider using a steaming JSON parser like ijson. This will allow you to only store in memory the pieces of the file you care about. Assuming you have a lot of duplicates in eagle
and sky
doing this step alone may reduce your memory usage enough to be quick again. Here's some code to illustrate, you'll have to run pip install ijson
to run it:
from ijson import items
eagle = set()
sky = set()
with open("reverseURL.json") as file:
for o in items(file, "eagle"):
eagle.update(o)
# Read the file again
file.seek(0)
for o in items(file, "sky"):
sky.update(o)
eagleAndSky = eagle.intersection(sky)
If using a ijson
to parse the json as a steam doesn't get the memory usage down enough you'll have to store your temporary state on disk. Python sqlite3
module is a perfect fit for this type of work. You can create a temporary file database with a table for eagle and a table for sky, insert all the data into each table, add a unique index to remove duplicate data (and to speed up the query in the next step), then join the tables to get your intersection. Here's an example:
import os
import sqlite3
from tempfile import mktemp
from ijson import items
db_path = mktemp(suffix=".sqlite3")
conn = sqlite3.connect(db_path)
c = conn.cursor()
c.execute("create table eagle (foo text unique)")
c.execute("create table sky (foo text unique)")
conn.commit()
with open("reverseURL.json") as file:
for o in items(file, "eagle.item"):
try:
c.execute("insert into eagle (foo) values(?)", o)
except sqlite3.IntegrityError:
pass # this is expected on duplicates
file.seek(0)
for o in items(file, "sky.item"):
try:
c.execute("insert into sky (foo) values(?)", o)
except sqlite3.IntegrityError:
pass # this is expected on duplicates
conn.commit()
resp = c.execute("select sky.foo from eagle join sky on eagle.foo = sky.foo")
for foo, in resp:
print(foo)
conn.close()
os.unlink(db_path)