3

I am expecting a strange pandas behaviour. In the following code

import numpy as np
import pandas as pd

def info(df):
    print(f"whole df: {hex(id(df))}")
    print(f"col a   : {hex(id(df['a']))}")
    print(f"col b   : {hex(id(df['b']))}")
    print(f"col c   : {hex(id(df['c']))}")

def _drop(col):
    print(f"called on  : {col.name}")
    print(f"before drop: {hex(id(col))}")
    col[0] = -1    
    col.dropna(inplace=True)
    col[0] = 1
    print(f"after drop : {hex(id(col))}")   


df = pd.DataFrame([[np.nan, 1.2, np.nan],
                   [5.8, np.nan, np.nan]], columns=['a', 'b', 'c'])

info(df)
df.apply(_drop)
info(df)

if I comment out the dropna() line, or call dropna(inplace=False) I get a result that I expected (because dropna creates a copy and I am modifying the original series):

     a    b    c
 0  1.0  1.0  1.0
 1  5.8  NaN  NaN

But when dropna(inplace=True) the operation should be done inplace, thus modifying the original series, but the result I get is:

     a    b    c
 0 -1.0 -1.0 -1.0
 1  5.8  NaN  NaN

However I would expect the result to be the same as in previous cases. Is dropna operation returning a clone even though the operation is inplace? I am using pandas version 0.23.1.

Edit: Based on provided answers I added hex(ids()) calls to verify actual instances. The above code printed this (values might be different for you, but equality between them should be the same)

whole df   : 0x1f482392f28
col a      : 0x1f482392f60
col b      : 0x1f48452af98
col c      : 0x1f48452ada0
called on  : a
before drop: 0x1f480dcc2e8
after drop : 0x1f480dcc2e8
called on  : b
before drop: 0x1f480dcc2e8
after drop : 0x1f480dcc2e8
called on  : a
before drop: 0x1f480dcc2e8
after drop : 0x1f480dcc2e8
called on  : b
before drop: 0x1f4ffef1ef0
after drop : 0x1f4ffef1ef0
called on  : c
before drop: 0x1f480dcc2e8
after drop : 0x1f480dcc2e8
whole df   : 0x1f482392f28
col a      : 0x1f482392f60
col b      : 0x1f48452af98
col c      : 0x1f48452ada0

It is weird that the function is called 2 times on columns a and b, however the docs says it is called twice only on the first column.

Additionally, the hex value for the second pass of column b is different. Both does not happen when the col.drop() is omitted.

The hex values suggests that .apply() creates a new copy of the columns, however how it propagates the values back to the original df is unknown to me.

Lukáš N.
  • 31
  • 4

2 Answers2

0

I tried to reason through this with variable scope concepts, wouldn't consider it as a full answer but maybe it will be insightful for someone else.

When .apply executes on each series corresponding here to the col argument, inside the scope of _drop() line col[0] = -1 changes globally "first row" of the df and therefore it mutates it. When dropna() is called with inplace=True, NaNs are actually dropped but ONLY for the series inside the scope of that function, it's not assigned to the global df. Even though it overwrites the variable col. Another insight might be that Docs says that .dropna(inplace=True) returns None and _drop() also would return None since there is no return statement.

AnotherLazyPeon
  • 131
  • 1
  • 1
  • 9
  • The line `col[0] = -1` does not change globally first row of the df, but only the first row in the `col` series (that is one specific element in the df). I updated my code and you can see, that the variable `col` is not modified after the drop. Lastly, I don't see any connection with the return type, both `.dropna(inplace=True)` and `_drop()` modify the object inplace, therefore they do not need a return type. – Lukáš N. Jul 16 '18 at 21:11
0

It might be worth raising this issue at the pandas / numpy github - to me, this looks like unexpected behavior - If you add a return col statement to the function, your code works as expected. This indicates that indeed, a local copy is created. print(hex(id(col))) confirms this.

def _drop(col):
    col[0] = -1
    col.dropna(inplace=True)
    col[0] = 1
    return col # <----

df = pd.DataFrame([[np.nan, 1.2, np.nan],
                   [5.8, np.nan, np.nan]], columns=['a', 'b', 'c'])

df.apply(_drop)
Thomas
  • 4,696
  • 5
  • 36
  • 71
  • Yes this approach works, however the line `df.apply(_drop)` creates a new dataframe. The original `df` remains with -1 as in my question. That might be the reason why the id is different. – Lukáš N. Jul 16 '18 at 20:47
  • As mentioned, I recommend raising an issue on the [pandas github page](https://github.com/pandas-dev/pandas). Even if this is not a bug, it is definitely unexpected behavior. – Thomas Jul 17 '18 at 08:32
  • If you do raise an issue there, please post the link here, I would be interested in the outcome as well. – Thomas Jul 17 '18 at 12:35