5

I have two DataFrames and I want to perform the same list of cleaning ops. I realized I can merge into one, and to everything in one pass, but I am still curios why this method is not working

test_1 = pd.DataFrame({
    "A": [1, 8, 5, 6, 0],
    "B": [15, 49, 34, 44, 63]
})
test_2 = pd.DataFrame({
    "A": [np.nan, 3, 6, 4, 9, 0],
    "B": [-100, 100, 200, 300, 400, 500]
})

Let's assume I want to only take the raws without NaNs: I tried

for df in [test_1, test_2]:
    df = df[pd.notnull(df["A"])]

but test_2 is left untouched. On the other hand if I do:

test_2 = test_2[pd.notnull(test_2["A"])]

Now I the first raw went away.

cs95
  • 379,657
  • 97
  • 704
  • 746
meto
  • 3,425
  • 10
  • 37
  • 49
  • 1
    You're assigning the output to `df`, the loop variable. The underlying dataframes are unaltered. – pault Apr 23 '18 at 17:34
  • Related: [Do Python for loops work by reference?](https://stackoverflow.com/questions/14814771/do-python-for-loops-work-by-reference) – jpp Apr 23 '18 at 17:44

3 Answers3

9

All these slicing/indexing operations create views/copies of the original dataframe and you then reassign df to these views/copies, meaning the originals are not touched at all.

Option 1
dropna(...inplace=True)
Try an in-place dropna call, this should modify the original object in-place

df_list = [test_1, test_2]
for df in df_list:
    df.dropna(subset=['A'], inplace=True)  

Note, this is one of the few times that I will ever recommend an in-place modification, because of this use case in particular.


Option 2
enumerate with reassignment
Alternatively, you may re-assign to the list -

for i, df in enumerate(df_list):
    df_list[i] = df.dropna(subset=['A'])  # df_list[i] = df[df.A.notnull()]
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    I feel like there needs to be a canonical Q/A about "why isn't my DataFrame changed after operation x" – pault Apr 23 '18 at 17:40
  • @pault yeah, this is one of those questions that keeps getting asked frequently, but under such unassuming titles that there's no way to find them in a reasonable amount of time! I'll bookmark this question and keep it in my list of targets from now on though :) – cs95 Apr 23 '18 at 17:43
  • 1
    @pault Not specific to pandas but I think we have a canonical answer for this issue [here](https://stackoverflow.com/questions/14814771/do-python-for-loops-work-by-reference). – ayhan Apr 23 '18 at 17:43
  • in the answer @cᴏʟᴅsᴘᴇᴇᴅ you say `view`. that's what I thought too, that the var was just a view on the underlying `df` and that i was modifying it – meto Apr 25 '18 at 22:53
4

You are modifying copies of the dataframes rather than the original dataframes.

One way to deal with this issue is to use a dictionary. As a convenience, you can use pd.DataFrame.pipe together with dictionary comprehensions to modify your dictionaries.

def remove_nulls(df):
    return df[df['A'].notnull()]

dfs = dict(enumerate([test_1, test_2]))
dfs = {k: v.pipe(remove_nulls) for k, v in dfs.items()}

print(dfs)

# {0:    A   B
#     0  1  15
#     1  8  49
#     2  5  34
#     3  6  44
#     4  0  63,
#  1:      A    B
#     1  3.0  100
#     2  6.0  200
#     3  4.0  300
#     4  9.0  400
#     5  0.0  500}

Note: In your result dfs[1]['A'] remains float: this is because np.nan is considered float and we have not triggered a conversion to int.

jpp
  • 159,742
  • 34
  • 281
  • 339
3

By using pd.concat

[x.reset_index(level=0,drop=True) for _, x in pd.concat([test_1,test_2],keys=[0,1]).dropna().groupby(level=0)]
Out[376]: 
[     A   B
 0  1.0  15
 1  8.0  49
 2  5.0  34
 3  6.0  44
 4  0.0  63,      A    B
 1  3.0  100
 2  6.0  200
 3  4.0  300
 4  9.0  400
 5  0.0  500]
BENY
  • 317,841
  • 20
  • 164
  • 234