Read in csv file faster

Question

I am currently reading in a large csv file (around 100 million lines), using command along the lines of that described in https://docs.python.org/2/library/csv.html e.g. :

import csv
with open('eggs.csv', 'rb') as csvfile:
     spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
     for row in spamreader:
          process_row(row)

This is proving rather slow, I suspect because each line is read in individually (requiring lots of read calls to the hard drive). Is there any way of reading the whole csv file in at once, and then iterating over it? Although the file itself is large in size (e.g. 5Gb), my machine has sufficient ram to hold that in memory.

That code is not reading the file one line at a time. The input is read in appropriately-sized buffers. — Robᵩ, Oct 06 '16 at 19:11
It is more likely that it is slow because of RAM usage. A 5GB file probably takes much more than 5GB of RAM after parsing. — zvone, Oct 06 '16 at 19:15
Yes - it turned out that it quickly used 26gb of ram, before my machine crashed! Lesson learnt... — kyrenia, Oct 06 '16 at 19:24

score 3 · Answer 1 · answered Oct 06 '16 at 19:11

3

import pandas as pd
df =pd.DataFrame.from_csv('filename.csv')

This will read it in as a pandas dataframe so you can do all sorts of fun things with it

answered Oct 06 '16 at 19:11

Mohammad Athar

1,953
1
15
31

Maybe rather `read_csv`? See [this question](http://stackoverflow.com/questions/26495408/pandas-pandas-dataframe-from-csv-vs-pandas-read-csv). – Scarabee Oct 07 '16 at 06:59

score 1 · Answer 2 · answered Oct 06 '16 at 19:11

1

my machine has sufficient ram to hold that in memory.

Well then, call list on the iterator:

spamreader = list(csv.reader(csvfile, delimiter=' ', quotechar='|'))

answered Oct 06 '16 at 19:11

Moses Koledoye

77,341
8
133
139

score 1 · Answer 3 · answered Oct 06 '16 at 19:12

1

Yes, there is a way to read the entire file at once:

with open('eggs.csv', 'rb', 5000000000) as ...:
    ...

Reference: https://docs.python.org/2/library/functions.html#open

answered Oct 06 '16 at 19:12

Robᵩ

163,533
20
239
308

score 0 · Answer 4 · answered Oct 26 '20 at 12:09

If your csv file larger then your ram then you can use

DASK (Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections. )

Dask Dataframe from Dask Official ... Dask Wikipedia

with dask dataframe you can do data analysis even if you have big dataset

Read in csv file faster

4 Answers4