-1

I had posted a question with regard to memory errors while working with large csv files using pandas dataframe. To be more clear, I'm asking another question: I have memory errors while merging big csv files (more than 30 million rows). So, what is the solution for this? Thanks!

  • You can read your csv file by streaming csv file, please refer to this [post](http://stackoverflow.com/questions/17444679/reading-a-huge-csv-in-python). Or you can buy and add more RAM in your PC! If you need to do a lot of machine learning/deep learning work then that's probably the best solution. – Andreas Hsieh May 12 '16 at 17:26
  • 1
    Get more memory... – Alexander May 12 '16 at 17:28
  • The problem is not reading the files. Let's say I've read the files and I want to merge them based on one of the variables. I get an error message while merging the tables. – physics_2015 May 12 '16 at 17:30
  • you may want to use RDBMS (Database) or Spark for that. Databases are designed for joining tables. Well, not only for that... ;) – MaxU - stand with Ukraine May 12 '16 at 17:33

1 Answers1

0

Using Python/Pandas to process datasets with tens of millions of rows isn't ideal. Rather than processing a massive CSV, consider warehousing your data into a database like Redshift where you'll be able to query and manipulate your data thousands of times faster than you could do with Pandas. Once your data is in a database you can use SQL to aggregate/filter/reshape your data into "bite size" exports and extracts for local analysis using Pandas if you'd like.

Long term, consider using Spark which is a distributed data analysis framework built on Scala. It definitely has a steeper learning curve than Pandas but borrows a lot of the core concepts.

Redshift: https://aws.amazon.com/redshift/

Spark: http://spark.apache.org/

Anthony E
  • 11,072
  • 2
  • 24
  • 44