pandas: read a small random sample from big CSV, according to sampling policy

Question

Very related to Read a small random sample from a big CSV file into a Python data frame .

I have a very big csv, with columns patient_id,visit_data. I want to read a small sample from it, but if I sample a patient I want to sample all of his records.

Stefan · Answer 1 · 2015-12-31T19:18:28.587

If you want to keep working with .csv, you can read the files in chunks, select and concatenate the pertinent rows from each chunk along the below lines (see docs):

patient_id = id
patient = pd.DataFrame()
for chunk in pd.read_csv(filename, chunksize=chunksize):
    patient = pd.concat([patient, chunk[chunk.patient_id==id])

However, I would recommend taking a look at HDF5 storage via pandas as this allows you to select via queries on indexed data rather than iterating through a file. And there are of course various sql-based options (see basic example)

pandas: read a small random sample from big CSV, according to sampling policy

1 Answers1