0

I'm trying to merge very large csv files together using pandas and keep running out of memory. Is there a more memory efficient way to do this?

I've tried using Dask Dataframes instead of just pandas, but I still ran into the same problem.

temp_df = dd.merge(gcs_df, dias_fio_df, how='inner', on=['charttime', 'subject_id'])

I get a MemoryError: screenshot_of_error

Ana B
  • 1
  • 2

2 Answers2

0

Did you try to run the merges separately? Can you do the first merge without a memory error? You can try splitting the cell into two different cells, and deleting from memory the first two dataframes (gcs_e_df and gcs_m_df) using del, before doing the second merge.

Another thing that might creating the issue is if you have duplicate ['charttime', 'subject_id'] in your tables: this will completely blow the memory because you will join each row multiple times and the size of your dataframe will increase exponentially

astiegler
  • 315
  • 3
  • 15
  • Can you explain more about how to use del (maybe include an example)? – Ana B Jun 24 '19 at 14:45
  • See the bottom answer of this question for an example [https://stackoverflow.com/questions/32247643/how-to-delete-multiple-pandas-python-dataframes-from-memory-to-save-ram]. You just have to use the del operator on the dataframes and call the garbage collector – astiegler Jun 24 '19 at 16:19
  • Thanks for the example! So I use del on the previous dataframes I created to clear up space so the merge now works. However when I try to write this newly created/merged dataframe to a csv file, I get a memory error. Any tips here? – Ana B Jun 24 '19 at 18:32
  • I think you can try to reduce the chunksize for the to_csv – astiegler Jun 25 '19 at 13:39
0

This is common thing in pandas, and is relevant not only to CSV but all datasets.

When dealing with very large CSV files (or parquet etc.) and running out of memory, you can mitigate it in a few ways:

  1. Like it was mentioned above - read the CSV files in chunks: You can use the chunksize parameter of the read_csv() function in pandas to read the CSV files in smaller chunks. This way, you won't load the entire file into memory at once.

If you have it available, you can use a Dask.dataframe: Dask is a parallel computing library that can handle large datasets that don't fit into memory. Dask.dataframe is similar to pandas, but it can work on datasets that are too large to fit into your memory.

Use JupySQL along with DuckDB, this allows you to read the data only when you need to, instead of loading everything into the memory. Here is a recent tutorial about how to do it.

Leverage your database: If your dataset is too large to fit into memory, you can always store it in a database and use SQL to join and manipulate the data. This might be overkill if you don't have a DB up and running already.

Scale your machine via a cloud-based instance: You can use a cloud-based solution like Amazon Web Services or Google Cloud Platform to store and manipulate your data. This might be too much especially if you're running locally ad-hoc.

Ido Michael
  • 109
  • 7