-1

I'm having memory problems while using Pandas on some big CSV files (more than 30 million rows). So, I'm wondering what is the best solution for this? I need to merge couple big tables. Thanks a lot!

  • what is the size of the csv file and what is the size of your RAM?. Did you try properties like `low_memory=False` and `chunksize` while reading the data? – Kathirmani Sukumar May 12 '16 at 05:33

1 Answers1

0

Possible duplicate of Fastest way to parse large CSV files in Pandas.

The inference is, if you are loading the csv file data often, then a better way would be to parse it once (with conventional read_csv) and store it in HDF5 format. Pandas (with PyTables library), provides an efficient way to handle this issue [docs].

Also, the answer to What is the fastest way to upload a big csv file in notebook to work with python pandas? shows you the timed execution (timeit) of sample dataset with csv vs csv.gz vs Pickle vs HDF5 comparison.

Community
  • 1
  • 1
Sameer Mirji
  • 2,135
  • 16
  • 28
  • The problem is not in uploading the file. The problem is merging couple big tables. – physics_2015 May 12 '16 at 06:41
  • Your question is slightly misleading in that case. Although, `HDF5` format still works best for your requirement. Ref [this](http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas) for more clarity. – Sameer Mirji May 12 '16 at 07:00