I'm having memory problems while using Pandas on some big CSV files (more than 30 million rows). So, I'm wondering what is the best solution for this? I need to merge couple big tables. Thanks a lot!
Asked
Active
Viewed 124 times
-1
-
what is the size of the csv file and what is the size of your RAM?. Did you try properties like `low_memory=False` and `chunksize` while reading the data? – Kathirmani Sukumar May 12 '16 at 05:33
1 Answers
0
Possible duplicate of Fastest way to parse large CSV files in Pandas.
The inference is, if you are loading the csv file data often, then a better way would be to parse it once (with conventional read_csv
) and store it in HDF5 format. Pandas
(with PyTables
library), provides an efficient way to handle this issue [docs].
Also, the answer to What is the fastest way to upload a big csv file in notebook to work with python pandas? shows you the timed execution (timeit) of sample dataset with csv
vs csv.gz
vs Pickle
vs HDF5
comparison.

Community
- 1
- 1

Sameer Mirji
- 2,135
- 16
- 28
-
The problem is not in uploading the file. The problem is merging couple big tables. – physics_2015 May 12 '16 at 06:41
-
Your question is slightly misleading in that case. Although, `HDF5` format still works best for your requirement. Ref [this](http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas) for more clarity. – Sameer Mirji May 12 '16 at 07:00