I swear I saw this discussion somewhere some time ago but I cannot find this anywhere anymore.
Imagine I have this method:
def my_method():
df = pd.DataFrame({'val': np.random.randint(0, 1000, 1000000)})
return df[df['val'] == 1]
It has been some time since I decided not to do this because the method could return a view (this is not a certainty, depends on what pandas wants to do) instead of a new dataframe.
The issue with this, I read, is that if a view is returned the refcount in the original dataframe is not reduced because the is still referencing that old dataframe even though we are only using a small portion of the data.
I was advised to instead do the following:
def my_method():
df = pd.DataFrame({'val': np.random.randint(0, 1000, 1000000)})
return df.drop(df[df["val"] != 1].index)
In this case, the drop method creates a new dataframe only with the data we want to keep and as soon as the method finishes the refcount in the original dataframe would be set to zero making it susceptible to garbage collection and eventually freeing up the memory.
In summary, this would be much more memory friendly and will also ensure that the result of the method is a dataframe and not a view of a dataframe which can lead to the settingOnCopyWarning
we all love.
Is this still true? Or is it something I misread somewhere? I have tried to check whether this has some benefit on memory usage but given that I cannot control when the gc decides to "remove" things from memory, just ask it to collect stuff... I never seem to have any conclusive results.