I have a very large JSON file containing last month's reddit comments (~30 GBs), which is too large to store in RAM and analyze. I'm hoping there is a method to sample a random subset of the data for analytical purposes, before resorting to cloud compute.
I'm reading in a certain amount of comments using the following code:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import json
from IPython.display import display
%matplotlib inline
r_data = []
with open('reddit_1_28_2018.txt') as f:
counter = 0
for line in f:
r_data.append(json.loads(line))
counter += 1
if counter % 500000 == 0:
print ("Processed %d comments\n" % (counter))
if counter >= 2000000: break
print ("Data downloaded!")
A sample of the dataset can be found here: https://files.pushshift.io/reddit/comments/sample_data.json
h/t @deliriouslettuce