Use numpy.random.seed() when selecting subset of rows from large csv w/o knowing exact length

Question

train_df = pd.read_csv(train_file, header=0, skiprows=lambda i: i>0 and random.random() > 0.3)

I had this but realized this won't be reproducible. Is there a way to randomly select a subset of rows from a large csv without knowing the length of that file in a reproducible manner? Seems like this is something read_csv would support.

I know there is a function

df.sample(random_state=123)

However, Id need this functionality when reading in the csv because of the size of the file.

I know for certain that the number of rows is more than 900k, so I can do...

np.random.seed(42)
skip = np.random.randint(0,900000,200000)
train_df = pd.read_csv(train_file, header=0, skiprows=skip)

But this doesn't give every row an equal chance of making it into the sample, so not ideal. Can read_csv scan a csv and return the length of the file?

have u tired fixing the random seeds and then calling read_csv .? — Sreeram TP, Sep 26 '18 at 05:24

score 1 · Answer 1 · answered Sep 26 '18 at 05:34

Here is necessary read file twice - first for length and then by read_csv, because read_csv cannot return the length of the file:

np.random.seed(1245)

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

train_file = 'file.csv'
num = file_len(train_file)
print (num)

skip = np.random.randint(0,num,200000)
#more dynamic - 20% of length 
#skip = np.random.randint(0,num,int(num * 0.2))
train_df = pd.read_csv(train_file, header=0, skiprows=skip)
print (train_df)

score 1 · Answer 2 · answered Sep 26 '18 at 05:39

1

You could try

import pandas as pd
import numpy as np
np.random.seed(4)
pd.read_csv(file, header=0,
            skiprows=lambda i: i>0 and np.random.choice(5))

answered Sep 26 '18 at 05:39

Sai Kumar

665
2
9
21

score 1 · Accepted Answer · answered Sep 26 '18 at 05:56

1

np.random.seed(42)
p = 0.3 #% of rows to read in
train_df = pd.read_csv(train_file, header=0, skiprows=lambda x: (x>0) & (np.random.random() > p))

answered Sep 26 '18 at 05:56

Yale Newman

1,141
1
13
22

Use numpy.random.seed() when selecting subset of rows from large csv w/o knowing exact length

3 Answers3