Reading huge csv files efficiently?

Question

I know how to use pandas to read files with CSV extension. When reading a large file i get an out of memory error. The file is 3.8 million rows and 6.4 millions columns file. There is mostly genome data in the file of large populations.

How can I overcome the problem, what is standard practice and how do I select the appropriate tool for this. Can I process a file this big with pandas, or there is another tool?

Do you need to read the whole file? you can pass `chunksize` param to `read_csv` and process the chunks — EdChum, Nov 13 '15 at 09:56
Maybe help [this question](http://stackoverflow.com/questions/33542977/pandas-groupby-with-sum-on-large-csv-file). — jezrael, Nov 13 '15 at 10:07

score 2 · Accepted Answer · answered Nov 13 '15 at 09:30

2

You can use Apache Spark to distribute in-memory processing of csv files https://github.com/databricks/spark-csv. Take a look at ADAM's approach for distributed genomic data processing.

answered Nov 13 '15 at 09:30

Igor Barinov

21,820
10
28
33

score 0 · Answer 2 · answered Nov 13 '15 at 09:30

0

You can use python csv module

with open(filename, "r") as csvfile:
    datareader = csv.reader(csvfile)
    for i in datareader:
        #process each line
        #You now only hold one row in memory, instead of your thousands of lines

answered Nov 13 '15 at 09:30

itzMEonTV

19,851
4
39
49

Reading huge csv files efficiently?

2 Answers2

Linked