I am expecting a strange pandas behaviour. In the following code
import numpy as np
import pandas as pd
def info(df):
print(f"whole df: {hex(id(df))}")
print(f"col a : {hex(id(df['a']))}")
print(f"col b : {hex(id(df['b']))}")
print(f"col c : {hex(id(df['c']))}")
def _drop(col):
print(f"called on : {col.name}")
print(f"before drop: {hex(id(col))}")
col[0] = -1
col.dropna(inplace=True)
col[0] = 1
print(f"after drop : {hex(id(col))}")
df = pd.DataFrame([[np.nan, 1.2, np.nan],
[5.8, np.nan, np.nan]], columns=['a', 'b', 'c'])
info(df)
df.apply(_drop)
info(df)
if I comment out the dropna()
line, or call dropna(inplace=False)
I get a result that I expected (because dropna
creates a copy and I am modifying the original series):
a b c
0 1.0 1.0 1.0
1 5.8 NaN NaN
But when dropna(inplace=True)
the operation should be done inplace, thus modifying the original series, but the result I get is:
a b c
0 -1.0 -1.0 -1.0
1 5.8 NaN NaN
However I would expect the result to be the same as in previous cases. Is dropna
operation returning a clone even though the operation is inplace?
I am using pandas version 0.23.1.
Edit:
Based on provided answers I added hex(ids())
calls to verify actual instances. The above code printed this (values might be different for you, but equality between them should be the same)
whole df : 0x1f482392f28
col a : 0x1f482392f60
col b : 0x1f48452af98
col c : 0x1f48452ada0
called on : a
before drop: 0x1f480dcc2e8
after drop : 0x1f480dcc2e8
called on : b
before drop: 0x1f480dcc2e8
after drop : 0x1f480dcc2e8
called on : a
before drop: 0x1f480dcc2e8
after drop : 0x1f480dcc2e8
called on : b
before drop: 0x1f4ffef1ef0
after drop : 0x1f4ffef1ef0
called on : c
before drop: 0x1f480dcc2e8
after drop : 0x1f480dcc2e8
whole df : 0x1f482392f28
col a : 0x1f482392f60
col b : 0x1f48452af98
col c : 0x1f48452ada0
It is weird that the function is called 2 times on columns a
and b
, however the docs says it is called twice only on the first column.
Additionally, the hex value for the second pass of column b
is different. Both does not happen when the col.drop()
is omitted.
The hex values suggests that .apply()
creates a new copy of the columns, however how it propagates the values back to the original df
is unknown to me.