0

Let's say we want to, replace values in A_1 and 'A_2' according to a mask in B_1 and B_2. For example, replace those values in A (to 999) that corresponds to nulls in B.

The original dataframe:

   A_1  A_2  B_1  B_2
0    1    4    y    n
1    2    5    n  NaN
2    3    6  NaN  NaN

The desired dataframe

   A_1  A_2  B_1  B_2
0    1    4    y    n
1    2  999    n  NaN
2  999  999  NaN  NaN

The code:

df = pd.DataFrame({
     'A_1': [1, 2, 3], 
     'A_2': [4, 5, 6], 
     'B_1': ['y', 'n', np.nan], 
     'B_2': ['n', np.nan, np.nan]})

_mask = df[['B_1', 'B_2']].notnull()
df[['A_1', 'A_2']] = df[['A_1','A_2']].where(_mask, other=999)

But I get the following instead:

   A_1  A_2  B_1  B_2
0  999  999    y    n
1  999  999    n  NaN
2  999  999  NaN  NaN

So, all of the are replaced, not just the not null. How can I fixed this>

I was following this answer but it only works if I use the same columns:

df[['B_1', 'B_2']] = odf[['B_1','B_2']].where(_mask, other=999)

   A_1  A_2  B_1  B_2
0    1    4    y    n
1    2    5    n  999
2    3    6  999  999
cs95
  • 379,657
  • 97
  • 704
  • 746
toto_tico
  • 17,977
  • 9
  • 97
  • 116

3 Answers3

1

There are indexing issues (on the column name). Convert the mask to a numpy array, and you're good to go:

df[['A_1', 'A_2']] = df.filter(like='A').mask(
    df.filter(like='B').isnull().values, 999
)

df
   A_1  A_2  B_1  B_2
0    1    4    y    n
1    2  999    n  NaN
2  999  999  NaN  NaN

Alternatively, keep this in numpy space:

df[['A_1', 'A_2']] = np.where(
    df.filter(like='B').isnull(), 999, df.filter(like='A')
)

df
   A_1  A_2  B_1  B_2
0    1    4    y    n
1    2  999    n  NaN
2  999  999  NaN  NaN
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Sorry, I didn't see you also had the numpy-level solution. – Ami Tavory May 03 '18 at 15:45
  • @AmiTavory No worries, we posted at nearly the same time :) – cs95 May 03 '18 at 15:47
  • if you are wondering how @coldspeed manage to use `null` instead of `notnull`, it is because the [`mask()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mask.html#pandas.DataFrame.mask) is more intuitive than [`where()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html) in this case. [`np.where()`](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.where.html) is a different story – toto_tico May 03 '18 at 16:01
1

Your code works with np.where:

df[['A_1', 'A_2']] = np.where(_mask, df[['A_1', 'A_2']], 999)
>>> df
A_1 A_2 B_1 B_2
0   1   4   y   n
1   2   999 n   NaN
2   999 999 NaN NaN

As it is a numpy-level function unaware of column names in any case.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
0

So the problem is that the name of the columns are called different in the _mask. One option would be rename the columns, so they have the same as the targets:

_mask = df[['B_1', 'B_2']].notnull()
_mask.columns = ['A_1', 'A_2']
df[['A_1', 'A_2']] = df[['A_1','A_2']].where(_mask, other=999)

Or, easier, if you just call the values attribute

_mask = df[['B_1', 'B_2']].notnull().values
df[['A_1', 'A_2']] = df[['A_1','A_2']].where(_mask, other=999)
toto_tico
  • 17,977
  • 9
  • 97
  • 116