15

I swear I saw this discussion somewhere some time ago but I cannot find this anywhere anymore.

Imagine I have this method:

def my_method():
    df = pd.DataFrame({'val': np.random.randint(0, 1000, 1000000)})
    return df[df['val'] == 1]

It has been some time since I decided not to do this because the method could return a view (this is not a certainty, depends on what pandas wants to do) instead of a new dataframe.

The issue with this, I read, is that if a view is returned the refcount in the original dataframe is not reduced because the is still referencing that old dataframe even though we are only using a small portion of the data.

I was advised to instead do the following:

def my_method():
    df = pd.DataFrame({'val': np.random.randint(0, 1000, 1000000)})
    return df.drop(df[df["val"] != 1].index)

In this case, the drop method creates a new dataframe only with the data we want to keep and as soon as the method finishes the refcount in the original dataframe would be set to zero making it susceptible to garbage collection and eventually freeing up the memory.

In summary, this would be much more memory friendly and will also ensure that the result of the method is a dataframe and not a view of a dataframe which can lead to the settingOnCopyWarning we all love.

Is this still true? Or is it something I misread somewhere? I have tried to check whether this has some benefit on memory usage but given that I cannot control when the gc decides to "remove" things from memory, just ask it to collect stuff... I never seem to have any conclusive results.

iacob
  • 20,084
  • 6
  • 92
  • 119
Juanpe Araque
  • 579
  • 1
  • 4
  • 16

3 Answers3

1

You can always use df.query() method and by using the inplace=True you can set the result on the original dataset and don't need to create a copy dataset.

Code :

def my_method_3(df):
  return df.query('val == 1',inplace=True)
 
  my_method_3(df)

Also the method:

def my_method():
    df = pd.DataFrame({'val': np.random.randint(0, 1000, 1000000)})
    return df.drop(df[df["val"] != 1].index)

might not be very efficient for large datasets. I tried clocking a benchmark of this method and could see the following: CPU times: user 327 ms, sys: 51.4 ms, total: 379 ms Wall time: 394 ms.

Whereas in contrast the df.query method took CPU times: user 14.3 ms, sys: 7.39 ms, total: 21.7 ms Wall time: 18.6 ms.

Jofre
  • 3,718
  • 1
  • 23
  • 31
aninda
  • 31
  • 3
  • 1
    the `inplace` argument will soon be deprecated as it is not always 100% understood what happens under the hood: https://github.com/pandas-dev/pandas/issues/16529 and https://stackoverflow.com/questions/43893457/understanding-inplace-true – LazyEval Jun 28 '21 at 14:12
0

If you want to avoid returning a view, simply change the return statement from df[mask] to df[mask].copy().

iacob
  • 20,084
  • 6
  • 92
  • 119
0

Using the 'drop' method is also not a good idea as it is much slower due to having to work on the entire table in memory. Best is to get what you need and then return a copy of this subset as @iacob says using 'df[df['val'] == 1].copy()'. This is 20% faster than the query method and avoids the deprecation issue.

The SettingWithCopy warning is a result of chaining, which you aren't doing in this case (see here), but it doesn't make sense to return a view on a DataFrame you have no use for, and hence .copy() would be better practice.

Regarding memory usage, using a copy, there should be no memory issue, but when in doubt, you can use 'del df' to clear it for a very small penalty to time (still faster than query).

Talha Tayyab
  • 8,111
  • 25
  • 27
  • 44
Daniel Redgate
  • 219
  • 1
  • 6