150

I'm currently trying to read data from .csv files in Python 2.7 with up to 1 million rows, and 200 columns (files range from 100mb to 1.6gb). I can do this (very slowly) for the files with under 300,000 rows, but once I go above that I get memory errors. My code looks like this:

def getdata(filename, criteria):
    data=[]
    for criterion in criteria:
        data.append(getstuff(filename, criteron))
    return data

def getstuff(filename, criterion):
    import csv
    data=[]
    with open(filename, "rb") as csvfile:
        datareader=csv.reader(csvfile)
        for row in datareader: 
            if row[3]=="column header":
                data.append(row)
            elif len(data)<2 and row[3]!=criterion:
                pass
            elif row[3]==criterion:
                data.append(row)
            else:
                return data

The reason for the else clause in the getstuff function is that all the elements which fit the criterion will be listed together in the csv file, so I leave the loop when I get past them to save time.

My questions are:

  1. How can I manage to get this to work with the bigger files?

  2. Is there any way I can make it faster?

My computer has 8gb RAM, running 64bit Windows 7, and the processor is 3.40 GHz (not certain what information you need).

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Charles Dillon
  • 1,945
  • 6
  • 15
  • 18
  • 2
    I'm aware that there are several similar seeming questions, but none of them seemed to be specific enough to my problem to help much. Sorry if there is one that I missed. – Charles Dillon Jul 03 '13 at 09:48
  • 2
    You should store the read data in a database (e.g. Sqlite) instead of keeping it in memory. You can then run further processing like filtering on the db – Michael Butscher Jul 03 '13 at 09:54

7 Answers7

190

You are reading all rows into a list, then processing that list. Don't do that.

Process your rows as you produce them. If you need to filter the data first, use a generator function:

import csv

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        yield next(datareader)  # yield the header row
        count = 0
        for row in datareader:
            if row[3] == criterion:
                yield row
                count += 1
            elif count:
                # done when having read a consecutive series of rows 
                return

I also simplified your filter test; the logic is the same but more concise.

Because you are only matching a single sequence of rows matching the criterion, you could also use:

import csv
from itertools import dropwhile, takewhile

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        yield next(datareader)  # yield the header row
        # first row, plus any subsequent rows that match, then stop
        # reading altogether
        # Python 2: use `for row in takewhile(...): yield row` instead
        # instead of `yield from takewhile(...)`.
        yield from takewhile(
            lambda r: r[3] == criterion,
            dropwhile(lambda r: r[3] != criterion, datareader))
        return

You can now loop over getstuff() directly. Do the same in getdata():

def getdata(filename, criteria):
    for criterion in criteria:
        for row in getstuff(filename, criterion):
            yield row

Now loop directly over getdata() in your code:

for row in getdata(somefilename, sequence_of_criteria):
    # process row

You now only hold one row in memory, instead of your thousands of lines per criterion.

yield makes a function a generator function, which means it won't do any work until you start looping over it.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
76

Although Martijin's answer is prob best. Here is a more intuitive way to process large csv files for beginners. This allows you to process groups of rows, or chunks, at a time.

import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)
mmann1123
  • 5,031
  • 7
  • 41
  • 49
  • 11
    Why does using pandas make it more intuitive? – wwii Sep 22 '17 at 04:23
  • 47
    4 lines of code is always better for newbies like myself. – mmann1123 Sep 27 '17 at 19:42
  • 4
    The regular Python code is just as short, and lets you process per line. The generator function is only there to filter stuff; how would you go about doing the same filtering in Pandas? – Martijn Pieters Nov 08 '17 at 15:22
  • 1
    This is awesome! Solved my problem of loading and processing large csv files using pandas. Thanks! – Elsa Li Mar 21 '18 at 18:27
  • 1
    It works very well even when the content of some rows span over multiple lines! – Dielson Sales Dec 26 '19 at 22:00
  • telling beginners to install `pandas` just to do simple line-by-line processing on a .csv file is a mistake. We all know that Python library management is a mess, there is no reason to do this when the builtin `csv` module is more than capable and come pre-installed. – user5359531 Jul 30 '20 at 14:30
  • @user5359531 data preparation and analysis is often done with pandas data frames. It is very common to load input data in the form of csv files. See also the other answer mentioning the arguments 'chunksize’ and ‘usecols’. – Paul Rougieux Nov 02 '21 at 10:32
  • There is no function process() in any packages. From where can Import it? – DmitriBolt Apr 22 '23 at 00:28
  • Sorry with `process()` I am just saying you can do some operation on it. – mmann1123 Apr 23 '23 at 02:05
26

I do a fair amount of vibration analysis and look at large data sets (tens and hundreds of millions of points). My testing showed the pandas.read_csv() function to be 20 times faster than numpy.genfromtxt(). And the genfromtxt() function is 3 times faster than the numpy.loadtxt(). It seems that you need pandas for large data sets.

I posted the code and data sets I used in this testing on a blog discussing MATLAB vs Python for vibration analysis.

Steve
  • 289
  • 3
  • 3
  • 7
    The OP's primary issue was not one of speed, it was one of memory exhaustion. Using a different function for processing the file itself doesn't remove the downsides of reading it into a list rather than using a stream processor. – pydsigner Feb 12 '19 at 19:52
13

For someone who lands to this question. Using pandas with ‘chunksize’ and ‘usecols’ helped me to read a huge zip file faster than the other proposed options.

import pandas as pd

sample_cols_to_keep =['col_1', 'col_2', 'col_3', 'col_4','col_5']

# First setup dataframe iterator, ‘usecols’ parameter filters the columns, and 'chunksize' sets the number of rows per chunk in the csv. (you can change these parameters as you wish)
df_iter = pd.read_csv('../data/huge_csv_file.csv.gz', compression='gzip', chunksize=20000, usecols=sample_cols_to_keep) 

# this list will store the filtered dataframes for later concatenation 
df_lst = [] 

# Iterate over the file based on the criteria and append to the list
for df_ in df_iter: 
        tmp_df = (df_.rename(columns={col: col.lower() for col in df_.columns}) # filter eg. rows where 'col_1' value grater than one
                                  .pipe(lambda x:  x[x.col_1 > 0] ))
        df_lst += [tmp_df.copy()] 

# And finally combine filtered df_lst into the final lareger output say 'df_final' dataframe 
df_final = pd.concat(df_lst)
ewalel
  • 1,932
  • 20
  • 25
11

what worked for me was and is superfast is

import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv', usecols=[col1, col2])
df_train=df_train.compute()
print("load train: " , time.clock()-t)

Another working solution is:

import pandas as pd 
from tqdm import tqdm

PATH = '../data/train.csv'
chunksize = 500000 
traintypes = {
'col1':'category',
'col2':'str'}

cols = list(traintypes.keys())

df_list = [] # list to hold the batch dataframe

for df_chunk in tqdm(pd.read_csv(PATH, usecols=cols, dtype=traintypes, chunksize=chunksize)):
    # Can process each chunk of dataframe here
    # clean_data(), feature_engineer(),fit()

    # Alternatively, append the chunk to list and merge all
    df_list.append(df_chunk) 

# Merge all dataframes into one dataframe
X = pd.concat(df_list)

# Delete the dataframe list to release memory
del df_list
del df_chunk
Yury Wallet
  • 1,474
  • 1
  • 13
  • 24
  • doesn't the `df_train=df_train.compute()` line in your first solution load the whole dataset into memory...which is what he's trying not to do? – Sam Dillard Feb 06 '20 at 18:05
  • `time.clock()` has been deprecated in Python 3.3 and will be removed from Python 3.8: use `time.perf_counter()` or `time.process_time()` instead. – Archon Oct 27 '21 at 08:11
2

here's another solution for Python3:

import csv
with open(filename, "r") as csvfile:
    datareader = csv.reader(csvfile)
    count = 0
    for row in datareader:
        if row[3] in ("column header", criterion):
            doSomething(row)
            count += 1
        elif count > 2:
            break

here datareader is a generator function.

Rishabh Agrahari
  • 3,447
  • 2
  • 21
  • 22
  • *So, this works as efficiently as the solution which uses yield operator.*: sorry, it does not. The callback function call adds more overhead, especially since you there have to handle state explicitly and separately. – Martijn Pieters Jul 10 '18 at 09:24
  • @MartijnPieters Thanks. Updated the answer. – Rishabh Agrahari Jul 10 '18 at 09:31
1

If you are using pandas and have lots of RAM (enough to read the whole file into memory) try using pd.read_csv with low_memory=False, e.g.:

import pandas as pd
data = pd.read_csv('file.csv', low_memory=False)
Mike T
  • 41,085
  • 18
  • 152
  • 203