I had posted a question with regard to memory errors while working with large csv files using pandas dataframe. To be more clear, I'm asking another question: I have memory errors while merging big csv files (more than 30 million rows). So, what is the solution for this? Thanks!
Asked
Active
Viewed 592 times
-1
-
You can read your csv file by streaming csv file, please refer to this [post](http://stackoverflow.com/questions/17444679/reading-a-huge-csv-in-python). Or you can buy and add more RAM in your PC! If you need to do a lot of machine learning/deep learning work then that's probably the best solution. – Andreas Hsieh May 12 '16 at 17:26
-
1Get more memory... – Alexander May 12 '16 at 17:28
-
The problem is not reading the files. Let's say I've read the files and I want to merge them based on one of the variables. I get an error message while merging the tables. – physics_2015 May 12 '16 at 17:30
-
you may want to use RDBMS (Database) or Spark for that. Databases are designed for joining tables. Well, not only for that... ;) – MaxU - stand with Ukraine May 12 '16 at 17:33
1 Answers
0
Using Python/Pandas to process datasets with tens of millions of rows isn't ideal. Rather than processing a massive CSV, consider warehousing your data into a database like Redshift where you'll be able to query and manipulate your data thousands of times faster than you could do with Pandas. Once your data is in a database you can use SQL to aggregate/filter/reshape your data into "bite size" exports and extracts for local analysis using Pandas if you'd like.
Long term, consider using Spark which is a distributed data analysis framework built on Scala. It definitely has a steeper learning curve than Pandas but borrows a lot of the core concepts.
Redshift: https://aws.amazon.com/redshift/
Spark: http://spark.apache.org/

Anthony E
- 11,072
- 2
- 24
- 44