0

I have a very large JSON file containing last month's reddit comments (~30 GBs), which is too large to store in RAM and analyze. I'm hoping there is a method to sample a random subset of the data for analytical purposes, before resorting to cloud compute.

I'm reading in a certain amount of comments using the following code:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import json
from IPython.display import display
%matplotlib inline

r_data = []
with open('reddit_1_28_2018.txt') as f:
    counter = 0
    for line in f:
        r_data.append(json.loads(line))
        counter += 1
        if counter % 500000 == 0:
            print ("Processed %d comments\n" % (counter))
        if counter >= 2000000: break

print ("Data downloaded!")

A sample of the dataset can be found here: https://files.pushshift.io/reddit/comments/sample_data.json

h/t @deliriouslettuce

Parseltongue
  • 11,157
  • 30
  • 95
  • 160

0 Answers0