Python: Read in a random sample of a json file

Asked Feb 23 '18 at 01:53

Active Feb 23 '18 at 02:06

Viewed 1,176 times

I have a very large JSON file containing last month's reddit comments (~30 GBs), which is too large to store in RAM and analyze. I'm hoping there is a method to sample a random subset of the data for analytical purposes, before resorting to cloud compute.

I'm reading in a certain amount of comments using the following code:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import json
from IPython.display import display
%matplotlib inline

r_data = []
with open('reddit_1_28_2018.txt') as f:
    counter = 0
    for line in f:
        r_data.append(json.loads(line))
        counter += 1
        if counter % 500000 == 0:
            print ("Processed %d comments\n" % (counter))
        if counter >= 2000000: break

print ("Data downloaded!")

A sample of the dataset can be found here: https://files.pushshift.io/reddit/comments/sample_data.json

h/t @deliriouslettuce

edited Feb 23 '18 at 02:06

asked Feb 23 '18 at 01:53

Parseltongue

11,157
30
95
160

So, each line in `reddit_1_28_2018.txt` is a JSON? – juanpa.arrivillaga Feb 23 '18 at 01:54
2

What does the file look like? – Stephen Rauch Feb 23 '18 at 01:56
1

Possible duplicate: https://stackoverflow.com/questions/9690009/pick-n-items-at-random-from-sequence-of-unknown-length In essence, decide how many samples you want to examine, then use [reservoir sampling](https://en.wikipedia.org/wiki/Reservoir_sampling) to select that many samples at random. – Robᵩ Feb 23 '18 at 02:03
1

@StephenRauch maybe this is a sample of it? https://files.pushshift.io/reddit/comments/sample_data.json – G_M Feb 23 '18 at 02:06
1

@DeliriousLettuce, this is correct. – Parseltongue Feb 23 '18 at 02:06

Python: Read in a random sample of a json file

0 Answers0