How to Filter specific rows from a huge CSV file using Python Script

Question

Is there an efficient way in python to load only specific rows from a huge csv file into the memory (for further processing) without burdening the memory?
E.g: Let's say I want to filter the rows from specific date out of a file in the following format, and let's say this file is tens or hundreds of gigs (dates are not ordered)

Date         event_type    country
2015/03/01   impression    US
2015/03/01   impression    US
2015/03/01   impression    CA
2015/03/01   click         CA
2015/03/02   impression    FR
2015/03/02   click         FR
2015/03/02   impression    US
2015/03/02   click         US

You don't have to read the file into memory, you can filter as you iterate but without some idea of what the data looks like and what you are filtering on it is impossible to supply a working example. — Padraic Cunningham, Mar 21 '16 at 13:29
By ordered I mean groups of dates not necessarily the earliest to latest — Padraic Cunningham, Mar 21 '16 at 14:12

max · Answer 1 · 2016-03-21T14:17:15.197

0

You still need to process every row in the file in order to check your clause. However, it's unnecessary to load all file into memory so you can use streams as following:

import csv
with open('huge.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=' ', quotechar='"')
    for row in spamreader:
        if row[0] == '2015/03/01':
            continue

        # Process data here

If you need just to have a list of matched rows it's faster and even simpler to use list comprehension as follow:

import csv
with open('huge.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=' ', quotechar='"')
    rows = [row for row in spamreader if row[0] == '2015/03/01']

edited Mar 21 '16 at 14:17

answered Mar 21 '16 at 14:00

max

2,757
22
19

Using a `filter` instead of a for-loop with an `if` can be much faster for bigger inputs. – Sergei Lebedev Mar 21 '16 at 14:04
@SergeiLebedev, of course. Especially in case if there is a need to have the list of the matched rows, not to calculate any aggregated values. – max Mar 21 '16 at 14:09
Actually, I was referring to the Python3 `filter`, which does not produce an intermediate list. The idea is to delegate as much iteration is possible to the C-side. – Sergei Lebedev Mar 21 '16 at 14:34

score 0 · Answer 2 · edited May 23 '17 at 12:23

0

If the dates can appear anywhere you will have to parse the whole file:

import csv

def get_rows(k, fle):
    with open(fle) as f:
        next(f)
        for row in csv.reader(f, delimiter=" ", skipinitialspace=1):
            if row[0] == k:
                yield row


for row in get_rows("2015/03/02", "in.txt"):
    print(row)

You could use the multiprocessing to speed up the parsing splitting the data into chunks. There are some ideas here

edited May 23 '17 at 12:23

Community

1
1

answered Mar 21 '16 at 14:03

Padraic Cunningham

176,452
29
245
321

You can relax the ordering requirement by using `filter` instead of `dropwhile`. – Sergei Lebedev Mar 21 '16 at 14:05

score 0 · Answer 3 · answered Mar 21 '16 at 14:10

0

import csv

filter_countries = {'US': 1}
with open('data.tsv', 'r') as f_name:
    for line in csv.DictReader(f_name, delimiter='\t'):
        if line['country'] not in filter_countries:
            print(line)

answered Mar 21 '16 at 14:10

Evandro C.

71
6

David R · Answer 4 · 2022-07-18T01:49:33.737

I prefer a pandas-only approach to this that allows you to use all the features of read.csv(). This approach envisions a situation where you may need to filter on different dates at different times, so it is worth a little overhead to create a date registry that can be saved to disk and re-used.

First, create a registry holding just the date data for your csv:

my_date_registry = pd.read_csv('data.csv', usecols=['Date'], engine='c')

(Note, in newer version of pandas, you can use engine = 'pyarrow', which will be faster.)

There are two ways of using this registry and the skiprows parameter to filter out the rows you don't want. You may wish to experiment as to which one is faster for your specific data.

Option 1: Build a list of integer indexes

filter_date = '2017-03-09'

my_rows = my_date_registry['Date'] == filter_date
skip_rows = ~my_rows
my_skip_indexes = my_data[skip_rows].index
my_skip_list = [x + 1 for x in my_skip_indexes]
my_selected_rows = pd.read_csv('data.csv', engine='c', skiprows=my_skip_list)

N.B. Since your data has header rows, you must add 1 to every index in my_skip_indexes to make up for the header row.

Option 2: Create a Callable function

filter_date = '2017-03-09'
my_rows = my_data[my_data['Date'] == my_date]
my_row_indexes = my_rows.index
my_row_indexes = set([0] + [x + 1 for x in my_row_indexes])
my_filter = lambda x: x not in my_row_indexes
my_selected_rows = pd.read_csv('data.csv',  engine='c', skiprows=my_filter)

How to Filter specific rows from a huge CSV file using Python Script

4 Answers4

Linked