1

I know how to use pandas to read files with CSV extension. When reading a large file i get an out of memory error. The file is 3.8 million rows and 6.4 millions columns file. There is mostly genome data in the file of large populations.

How can I overcome the problem, what is standard practice and how do I select the appropriate tool for this. Can I process a file this big with pandas, or there is another tool?

Nishant
  • 2,975
  • 23
  • 38
janejoj
  • 83
  • 6
  • 3
    Do you need to read the whole file? you can pass `chunksize` param to `read_csv` and process the chunks – EdChum Nov 13 '15 at 09:56
  • Maybe help [this question](http://stackoverflow.com/questions/33542977/pandas-groupby-with-sum-on-large-csv-file). – jezrael Nov 13 '15 at 10:07

2 Answers2

2

You can use Apache Spark to distribute in-memory processing of csv files https://github.com/databricks/spark-csv. Take a look at ADAM's approach for distributed genomic data processing.

Igor Barinov
  • 21,820
  • 10
  • 28
  • 33
0

You can use python csv module

with open(filename, "r") as csvfile:
    datareader = csv.reader(csvfile)
    for i in datareader:
        #process each line
        #You now only hold one row in memory, instead of your thousands of lines
itzMEonTV
  • 19,851
  • 4
  • 39
  • 49