5

I am trying to read a csv file in pandas. The data is something as follows:

Date    Value   Valid
23/05/2018  12  Yes
24/05/2018  13  No
25/05/2018  45  No
26/05/2018  11  Yes
27/05/2018  66  Yes
28/05/2018  50  No
29/05/2018  34  Yes
30/05/2018  27  No

I want to read the lines only where the input value for Valid column is a Yes. What would be the best way to do so?

I would like to filter it before and not after reading.

James Z
  • 12,209
  • 10
  • 24
  • 44
A.DS
  • 216
  • 1
  • 4
  • 14
  • 3
    you can't filter before reading the data – Mohamed Thasin ah May 30 '18 at 14:11
  • https://stackoverflow.com/questions/10717504/is-it-possible-to-use-read-csv-to-read-only-specific-lines – BENY May 30 '18 at 14:20
  • Wen's linked post is an example of reading "1-line-at-a-time". This, of course, works, but I recommend you consider chunking, as the only real reason you might do this is to optimize memory consumption. – jpp May 30 '18 at 14:24

1 Answers1

7

You cannot ignore rows before you have read them.

Instead, you can filter after reading the data:

# read file to dataframe
df = pd.read_csv('file.csv')

# apply mask to dataframe
df = df[df['Valid'] == 'Yes']

If you run into memory issues since, for example, there are a very large number of 'No' rows, which are expensive to read into memory, you can chunk. A convenient way to chunk silently is dask.dataframe:

import dask.dataframe as dd

# create lazy reader object
df = dd.read_csv('file.csv')

# define filtering logic
df = df[df['Valid'] == 'Yes']

# apply filtering logic and convert to pandas dataframe
df = df.compute()

Note that in the above example no computational work is involved until the last line, where dask takes chunks, filters, then aggregates results to a regular pandas dataframe.

jpp
  • 159,742
  • 34
  • 281
  • 339