I have two large json files representing one month's worth of Reddit Comments (37gb) and Reddit Submissions (16gb)
I want to pull them into a postgres installation on my local win7 machine.
My naive approach was to use psycopg2.extras.DictCursor
. I built a large list of dicts created from loading each json object from the file. Then perform a cur.executemany(insert_sql, json_dicts)
This works, but I don't think it scales. The json_dicts object will be enormous before it's time to execute.
I've been thinking how to write a scaleable script for this.
First, I suspect that it's fine to read through the entire json file -- that it won't explode memory?
I am processing it as:
with open(fp[foo], encoding="utf-8") as f:
for line in f:
j_content = json.loads(line)
If that's true, then I can just do batched commits (say commit every time I accumulate 100k objects). If I wrap it in try
nicely, then I ought to be able to see any point I have a failure and restart from that point the next time i try to process.
I'm asking for better advice on how to digest this stuff - I have no experience wrangling files this large.
I suppose too there might be some optimizations I should do on the postgres side to make this database manageable?