pandas DataFrame set value on boolean mask based on different columns

Question

Let's say we want to, replace values in A_1 and 'A_2' according to a mask in B_1 and B_2. For example, replace those values in A (to 999) that corresponds to nulls in B.

The original dataframe:

   A_1  A_2  B_1  B_2
0    1    4    y    n
1    2    5    n  NaN
2    3    6  NaN  NaN

The desired dataframe

   A_1  A_2  B_1  B_2
0    1    4    y    n
1    2  999    n  NaN
2  999  999  NaN  NaN

The code:

df = pd.DataFrame({
     'A_1': [1, 2, 3], 
     'A_2': [4, 5, 6], 
     'B_1': ['y', 'n', np.nan], 
     'B_2': ['n', np.nan, np.nan]})

_mask = df[['B_1', 'B_2']].notnull()
df[['A_1', 'A_2']] = df[['A_1','A_2']].where(_mask, other=999)

But I get the following instead:

   A_1  A_2  B_1  B_2
0  999  999    y    n
1  999  999    n  NaN
2  999  999  NaN  NaN

So, all of the are replaced, not just the not null. How can I fixed this>

I was following this answer but it only works if I use the same columns:

df[['B_1', 'B_2']] = odf[['B_1','B_2']].where(_mask, other=999)

   A_1  A_2  B_1  B_2
0    1    4    y    n
1    2    5    n  999
2    3    6  999  999

cs95 · Accepted Answer · 2018-05-03T15:51:34.963

1

There are indexing issues (on the column name). Convert the mask to a numpy array, and you're good to go:

df[['A_1', 'A_2']] = df.filter(like='A').mask(
    df.filter(like='B').isnull().values, 999
)

df
   A_1  A_2  B_1  B_2
0    1    4    y    n
1    2  999    n  NaN
2  999  999  NaN  NaN

Alternatively, keep this in numpy space:

df[['A_1', 'A_2']] = np.where(
    df.filter(like='B').isnull(), 999, df.filter(like='A')
)

df
   A_1  A_2  B_1  B_2
0    1    4    y    n
1    2  999    n  NaN
2  999  999  NaN  NaN

edited May 03 '18 at 15:51

answered May 03 '18 at 15:39

cs95

379,657
97
704
746

Sorry, I didn't see you also had the numpy-level solution. – Ami Tavory May 03 '18 at 15:45
@AmiTavory No worries, we posted at nearly the same time :) – cs95 May 03 '18 at 15:47
if you are wondering how @coldspeed manage to use `null` instead of `notnull`, it is because the [`mask()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mask.html#pandas.DataFrame.mask) is more intuitive than [`where()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html) in this case. [`np.where()`](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.where.html) is a different story – toto_tico May 03 '18 at 16:01

score 1 · Answer 2 · answered May 03 '18 at 15:42

1

Your code works with np.where:

df[['A_1', 'A_2']] = np.where(_mask, df[['A_1', 'A_2']], 999)
>>> df
A_1 A_2 B_1 B_2
0   1   4   y   n
1   2   999 n   NaN
2   999 999 NaN NaN

As it is a numpy-level function unaware of column names in any case.

answered May 03 '18 at 15:42

Ami Tavory

74,578
11
141
185

score 0 · Answer 3 · answered May 03 '18 at 15:40

0

So the problem is that the name of the columns are called different in the _mask. One option would be rename the columns, so they have the same as the targets:

_mask = df[['B_1', 'B_2']].notnull()
_mask.columns = ['A_1', 'A_2']
df[['A_1', 'A_2']] = df[['A_1','A_2']].where(_mask, other=999)

Or, easier, if you just call the values attribute

_mask = df[['B_1', 'B_2']].notnull().values
df[['A_1', 'A_2']] = df[['A_1','A_2']].where(_mask, other=999)

answered May 03 '18 at 15:40

toto_tico

17,977
9
97
116

Was this meant to be a self-answered question? – cs95 May 03 '18 at 15:42
yeah, sorry, I miss-clicked when I posted the answer. Nevertheless, I find your answer cleaner with the filters (which I didn't know about) – toto_tico May 03 '18 at 15:46
I am also intrigged why I am using `notnull()` and you are using `null()` – toto_tico May 03 '18 at 15:47
Since I'm replacing values which are null, I find those elements using "isnull"... semantically more meaningful ;-) – cs95 May 03 '18 at 15:49
I cannot agree more, but why both codes works, is it due to the filters? – toto_tico May 03 '18 at 15:50
1

I use isna + mask, you use notna + where... the distinction is important – cs95 May 03 '18 at 15:51

pandas DataFrame set value on boolean mask based on different columns

3 Answers3