python pandas memory error while merging big csv files

Question

I had posted a question with regard to memory errors while working with large csv files using pandas dataframe. To be more clear, I'm asking another question: I have memory errors while merging big csv files (more than 30 million rows). So, what is the solution for this? Thanks!

You can read your csv file by streaming csv file, please refer to this [post](http://stackoverflow.com/questions/17444679/reading-a-huge-csv-in-python). Or you can buy and add more RAM in your PC! If you need to do a lot of machine learning/deep learning work then that's probably the best solution. — Andreas Hsieh, May 12 '16 at 17:26
The problem is not reading the files. Let's say I've read the files and I want to merge them based on one of the variables. I get an error message while merging the tables. — physics_2015, May 12 '16 at 17:30
you may want to use RDBMS (Database) or Spark for that. Databases are designed for joining tables. Well, not only for that... ;) — MaxU - stand with Ukraine, May 12 '16 at 17:33

score 0 · Answer 1 · answered May 12 '16 at 17:28

Using Python/Pandas to process datasets with tens of millions of rows isn't ideal. Rather than processing a massive CSV, consider warehousing your data into a database like Redshift where you'll be able to query and manipulate your data thousands of times faster than you could do with Pandas. Once your data is in a database you can use SQL to aggregate/filter/reshape your data into "bite size" exports and extracts for local analysis using Pandas if you'd like.

Long term, consider using Spark which is a distributed data analysis framework built on Scala. It definitely has a steeper learning curve than Pandas but borrows a lot of the core concepts.

Redshift: https://aws.amazon.com/redshift/

Spark: http://spark.apache.org/

++ for using Spark ! – MaxU - stand with Ukraine May 12 '16 at 17:30 — MaxU - stand with Ukraine, May 12 '16 at 17:30

python pandas memory error while merging big csv files

1 Answers1