Digesting an month of Reddit Comments

Question

I have two large json files representing one month's worth of Reddit Comments (37gb) and Reddit Submissions (16gb)

I want to pull them into a postgres installation on my local win7 machine.

My naive approach was to use psycopg2.extras.DictCursor. I built a large list of dicts created from loading each json object from the file. Then perform a cur.executemany(insert_sql, json_dicts)

This works, but I don't think it scales. The json_dicts object will be enormous before it's time to execute.

I've been thinking how to write a scaleable script for this.

First, I suspect that it's fine to read through the entire json file -- that it won't explode memory?

I am processing it as:

with open(fp[foo], encoding="utf-8") as f:

    for line in f:
        j_content = json.loads(line)

If that's true, then I can just do batched commits (say commit every time I accumulate 100k objects). If I wrap it in try nicely, then I ought to be able to see any point I have a failure and restart from that point the next time i try to process.

I'm asking for better advice on how to digest this stuff - I have no experience wrangling files this large.

I suppose too there might be some optimizations I should do on the postgres side to make this database manageable?

While you avoid reading all file at once ( like `.read()`, `.readlines()` or `list(f)` etc ) - you can do not worry about memory, unless line is so huge, that it even could not be processed itself — Yaroslav Surzhikov, Oct 08 '17 at 04:18

score 0 · Answer 1 · answered Oct 08 '17 at 05:51

There's a wonderful engine called spark (which also has a python api, pyspark) which can generally handle datasets that are too big to fit in memory and can load json objects. It's a bit of a pain to install on win7 but it looks like there's a [postgres package(Using pyspark to connect to PostgreSQL) that you can download and use in conjunction with spark to do what you're asking. Its currently very popularily used in many industries because of how easily scalable it is.

Digesting an month of Reddit Comments

1 Answers1