train_df = pd.read_csv(train_file, header=0, skiprows=lambda i: i>0 and random.random() > 0.3)
I had this but realized this won't be reproducible. Is there a way to randomly select a subset of rows from a large csv without knowing the length of that file in a reproducible manner? Seems like this is something read_csv would support.
I know there is a function
df.sample(random_state=123)
However, Id need this functionality when reading in the csv because of the size of the file.
I know for certain that the number of rows is more than 900k, so I can do...
np.random.seed(42)
skip = np.random.randint(0,900000,200000)
train_df = pd.read_csv(train_file, header=0, skiprows=skip)
But this doesn't give every row an equal chance of making it into the sample, so not ideal. Can read_csv scan a csv and return the length of the file?