Skiprow condition based on values and not on index in pandas

Question

I am trying to read a csv file in pandas. The data is something as follows:

Date    Value   Valid
23/05/2018  12  Yes
24/05/2018  13  No
25/05/2018  45  No
26/05/2018  11  Yes
27/05/2018  66  Yes
28/05/2018  50  No
29/05/2018  34  Yes
30/05/2018  27  No

I want to read the lines only where the input value for Valid column is a Yes. What would be the best way to do so?

I would like to filter it before and not after reading.

https://stackoverflow.com/questions/10717504/is-it-possible-to-use-read-csv-to-read-only-specific-lines — BENY, May 30 '18 at 14:20
Wen's linked post is an example of reading "1-line-at-a-time". This, of course, works, but I recommend you consider chunking, as the only real reason you might do this is to optimize memory consumption. — jpp, May 30 '18 at 14:24

score 7 · Accepted Answer · answered May 30 '18 at 14:15

You cannot ignore rows before you have read them.

Instead, you can filter after reading the data:

# read file to dataframe
df = pd.read_csv('file.csv')

# apply mask to dataframe
df = df[df['Valid'] == 'Yes']

If you run into memory issues since, for example, there are a very large number of 'No' rows, which are expensive to read into memory, you can chunk. A convenient way to chunk silently is dask.dataframe:

import dask.dataframe as dd

# create lazy reader object
df = dd.read_csv('file.csv')

# define filtering logic
df = df[df['Valid'] == 'Yes']

# apply filtering logic and convert to pandas dataframe
df = df.compute()

Note that in the above example no computational work is involved until the last line, where dask takes chunks, filters, then aggregates results to a regular pandas dataframe.

Skiprow condition based on values and not on index in pandas

1 Answers1