How can I filter lines on load in Pandas read_csv function?

Question

How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv. Am I missing something?

Example: we've a CSV with a timestamp column and we'd like to load just the lines that with a timestamp greater than a given constant.

score 228 · Accepted Answer · edited Apr 20 '18 at 09:49

228

There isn't an option to filter the rows before the CSV file is loaded into a pandas object.

You can either load the file and then filter using df[df['field'] > constant], or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:

import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

You can vary the chunksize to suit your available memory. See here for more details.

edited Apr 20 '18 at 09:49

Madhup Kumar

5
3

answered Nov 30 '12 at 21:31

Matti John

19,329
7
41
39

for `chunk['filed']>constant` can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ? – weefwefwqg3 Feb 19 '17 at 06:32
1

Try: ``chunk[(chunk['field'] > constant2)&(chunk['field'] – Johannes Wachs May 21 '20 at 10:15
Is this missing a `.loc`? `chunk.loc[chunk['field'] > constant]` – Vincent Jun 25 '20 at 21:08
1

You can use boolean masks with or without `.loc`. I don't think `.loc` existed back in 2012, but I guess these days using `.loc` is a bit more explicit. – Matti John Jun 29 '20 at 12:32

score 11 · Answer 2 · edited May 23 '17 at 11:47

11

I didn't find a straight-forward way to do it within context of read_csv. However, read_csv returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]:

filtered = df[(df['timestamp'] > targettime)]

This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv call, that at least contains a datetime column timestamp) for which the values in the timestamp column are greater than the value of targettime. Similar question.

edited May 23 '17 at 11:47

Community

1
1

answered Nov 30 '12 at 19:43

Griffin

1,586
13
24

3

I'm not sure about this, but I have the feeling this would be extremely heavy on memory usage. – Nathan Aug 16 '19 at 07:18
The question is about filtering the data on loading, I don't know why so many upvoted this answer. – precise Apr 27 '23 at 21:01

score 5 · Answer 3 · answered Oct 11 '20 at 21:13

5

An alternative to the accepted answer is to apply read_csv() to a StringIO, obtained by filtering the input file.

with open(<file>) as f:
    text = "\n".join([line for line in f if <condition>])

df = pd.read_csv(StringIO(text))

This solution is often faster than the accepted answer when the filtering condition retains only a small portion of the lines

answered Oct 11 '20 at 21:13

M. Page

2,694
2
20
35

How does the `` work here? `text = "\n".join([line for line in f if df['column']=='Hello'])` Like that? edit: no it doesn't work. `TypeError: string indices must be integers ` – SCool Jan 07 '21 at 09:33
No. Suppose 'column' is the third column in your CSV file, assuming ',' is the field-separator, write: `with open() as f: text = "\n".join([line for line in f if line.split(',')[2] == "Hello"])` – M. Page May 30 '21 at 21:27

mirekphd · Answer 4 · 2020-02-02T14:06:36.650

If the filtered range is contiguous (as it usually is with time(stamp) filters), then the fastest solution is to hard-code the range of rows. Simply combine skiprows=range(1, start_row) with nrows=end_row parameters. Then the import takes seconds where the accepted solution would take minutes. A few experiments with the initial start_row are not a huge cost given the savings on import times. Notice we kept header row by using range(1,..).

score 1 · Answer 5 · answered Jun 28 '21 at 06:03

1

Consider you have the below dataframe

+----+--------+
| Id | Name   |
+----+--------+
|  1 | Sarath |
|  2 | Peter  |
|  3 | James  |
+----+--------+

If you need to filter a record where Id = 1 then you can use the below code.

df = pd.read_csv('Filename.csv', sep = '|')
df = df [(df ["Id"] == 1)]

This will produce the below output.

+----+--------+
| Id | Name   |
+----+--------+
|  1 | Sarath |
+----+--------+

answered Jun 28 '21 at 06:03

Sarath Subramanian

20,027
11
82
86

Apparently he is asking for a selective method "on load." Your solution would load all the data to memory and then do the filtering. – precise Apr 27 '23 at 20:49
1

Excellent answer, thank you! – George Chalhoub Jun 18 '23 at 17:45

score -1 · Answer 6 · answered Dec 13 '17 at 14:26

If you are on linux you can use grep.

# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


def zgrep_data(f, string):
    '''grep multiple items f is filepath, string is what you are filtering for'''

    grep = 'grep' # change to zgrep for gzipped files
    print('{} for {} from {}'.format(grep,string,f))
    start_time = time()
    if string == '':
        out = subprocess.check_output([grep, string, f])
        grep_data = StringIO(out)
        data = pd.read_csv(grep_data, sep=',', header=0)

    else:
        # read only the first row to get the columns. May need to change depending on 
        # how the data is stored
        columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]    

        out = subprocess.check_output([grep, string, f])
        grep_data = StringIO(out)

        data = pd.read_csv(grep_data, sep=',', names=columns, header=None)

    print('{} finished for {} - {} seconds'.format(grep,f,time()-start_time))
    return data

Using grep is seriously bad choice for several reasons. 1) it's slow 2) it's not portable 3) it's not pandas or python (you can use regular expressions right inside python) which is why I downvoted your answer — Ahmed Masud, Mar 07 '19 at 05:31
Your solution doesn't work on all platforms and also it includes Grep. This is the reason for the downvote. — Roman Orac, Jul 04 '19 at 08:30

score -5 · Answer 7 · answered Nov 12 '18 at 05:59

-5

You can specify nrows parameter.

import pandas as pd df = pd.read_csv('file.csv', nrows=100)

This code works well in version 0.20.3.

answered Nov 12 '18 at 05:59

user1083290

27
1

1

OP is asking how to filter not limit the number of lines read. This is why I downvoted your answer. – Roman Orac Jul 04 '19 at 08:33

How can I filter lines on load in Pandas read_csv function?

7 Answers7

Linked