0
train_df = pd.read_csv(train_file, header=0, skiprows=lambda i: i>0 and random.random() > 0.3)

I had this but realized this won't be reproducible. Is there a way to randomly select a subset of rows from a large csv without knowing the length of that file in a reproducible manner? Seems like this is something read_csv would support.

I know there is a function

df.sample(random_state=123) 

However, Id need this functionality when reading in the csv because of the size of the file.

I know for certain that the number of rows is more than 900k, so I can do...

np.random.seed(42)
skip = np.random.randint(0,900000,200000)
train_df = pd.read_csv(train_file, header=0, skiprows=skip)

But this doesn't give every row an equal chance of making it into the sample, so not ideal. Can read_csv scan a csv and return the length of the file?

Yale Newman
  • 1,141
  • 1
  • 13
  • 22

3 Answers3

1

Here is necessary read file twice - first for length and then by read_csv, because read_csv cannot return the length of the file:

np.random.seed(1245)

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

train_file = 'file.csv'
num = file_len(train_file)
print (num)

skip = np.random.randint(0,num,200000)
#more dynamic - 20% of length 
#skip = np.random.randint(0,num,int(num * 0.2))
train_df = pd.read_csv(train_file, header=0, skiprows=skip)
print (train_df)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

You could try

import pandas as pd
import numpy as np
np.random.seed(4)
pd.read_csv(file, header=0,
            skiprows=lambda i: i>0 and np.random.choice(5))
Sai Kumar
  • 665
  • 2
  • 9
  • 21
1
np.random.seed(42)
p = 0.3 #% of rows to read in
train_df = pd.read_csv(train_file, header=0, skiprows=lambda x: (x>0) & (np.random.random() > p))
Yale Newman
  • 1,141
  • 1
  • 13
  • 22