6

I am running a long ETL pipeline in pandas. I have to create different pandas dataframes and I want to release memory for some of the dataframes.

I have been reading how to release memory and I saw that runing this command doesn't release the memory:

del dataframe

Following this link: How to delete multiple pandas (python) dataframes from memory to save RAM?, one of the answer say that del statement does not delete an instance, it merely deletes a name.

In the answer they say about put the dataframe in a list and then del the list:

lst = [pd.DataFrame(), pd.DataFrame(), pd.DataFrame()]
del lst  

If I only want to release one dataframe I need to put it in a list and then delete a list like this:

lst = [pd.DataFrame()]
del lst

I have seen also this question: How do I release memory used by a pandas dataframe?

There are different answers like:

import gc
del df_1
gc.collect()

Or

just at the end of the dataframe use

df = ""

or there is a better way to achieve that?

J.C Guzman
  • 1,192
  • 3
  • 16
  • 40

1 Answers1

7

From the original link that you included, you have to include variable in the list, delete the variable and then delete the list. If you just add to the list, it won't delete the original dataframe, when you delete the list.

import pandas
import psutil 
import gc
psutil.virtual_memory().available * 100 / psutil.virtual_memory().total
>> 68.44267845153809

df = pd.read_csv('pythonSRC/bigFile.txt',sep='|')
len(df)
>> 20082056

psutil.virtual_memory().available * 100 / psutil.virtual_memory().total

>> 56.380510330200195

lst = [df]
del lst

psutil.virtual_memory().available * 100 / psutil.virtual_memory().total
>> 56.22601509094238

lst = [df]
del df
del lst

psutil.virtual_memory().available * 100 / psutil.virtual_memory().total
>> 76.77617073059082

gc.collect()

>> 0

I tried also just deleting the dataframe and using gc.collect() with the same result!

del df
gc.collect()
psutil.virtual_memory().available * 100 / psutil.virtual_memory().total
>> 76.59363746643066

However, the execution time of adding the dataframe to the list and deleting the list and the variable is a bit faster then calling gc.collect(). I used time.time() to measure the difference and gc.collect() was almost a full second slower!

EDIT:

according to the correct comment below, del df and del [df] indeed generate the same code. The problem with the original post, and my original answer is that as soon as you give a name to the list as in lst=[df], you are no longer referencing the original dataframe.

lst=[df] 
del lst

is not the same as:

del [df]
Bruck1701
  • 269
  • 1
  • 8
  • 1
    As per [this answer](https://stackoverflow.com/a/72895404/3109189), the above is wrong: `del df` and `del [df]` compile to the exact same bytecode. – PLNech Jul 07 '22 at 10:00