1

I have a data frame with:

customer_id [1,2,3,4,5,6,7,8,9,10]
feature1 [0,0,1,1,0,0,1,1,0,0]
feature2 [1,0,1,0,1,0,1,0,1,0]
feature3 [0,0,1,0,0,0,1,0,0,0]

Using this I want to create a new variable (say new_var) to say when feature 1 is 1 then the new_var=1, if feature_2=1 then new_var=2, feature3=1 then new_var=3 else 4. I was trying np.where but though it doesn't give me an error, it doesn't do the right thing - so I guess a nested np.where works on a single variable only. In which case, what's the best way to perform a nested if/case when in pandas?

My np.where code was something like this:

df[new_var]=np.where(df['feature1']==1,'1', np.where(df['feature2']==1,'2', np.where(df[feature3']==1,'3','4')))
Shraddha
  • 155
  • 3
  • 16
  • Just to sort of answer my own question: The np.where solution I have mentioned as something I tried also works - the reason it wasn't giving me the correct results was cos the datatype for feature1 was string and not integer.. so for anyone looking for similar questions, both the 'nested np.where' solution and the 'numpy.select' solution jezrael mentioned works – Shraddha Aug 10 '17 at 16:48

2 Answers2

1

I think you need numpy.select - it select first True values and all another are not important:

m1 = df['feature1']==1 
m2 = df['feature2']==1    
m3 = df['feature3']==1 
df['new_var'] = np.select([m1, m2, m3], ['1', '2', '3'], default='4')

Sample:

customer_id = [1,2,3,4,5,6,7,8,9,10]
feature1 = [0,0,1,1,0,0,1,1,0,0]
feature2 = [1,0,1,0,1,0,1,0,1,0]
feature3  = [0,0,1,0,0,0,1,0,0,0]

df = pd.DataFrame({'customer_id':customer_id,
                   'feature1':feature1,
                   'feature2':feature2,
                   'feature3':feature3})

m1 = df['feature1']==1 
m2 = df['feature2']==1    
m3 = df['feature3']==1 
df['new_var'] = np.select([m1, m2, m3], ['1', '2', '3'], default='4')
print (df)
   customer_id  feature1  feature2  feature3 new_var
0            1         0         1         0       2
1            2         0         0         0       4
2            3         1         1         1       1
3            4         1         0         0       1
4            5         0         1         0       2
5            6         0         0         0       4
6            7         1         1         1       1
7            8         1         0         0       1
8            9         0         1         0       2
9           10         0         0         0       4

If in features only 1 and 0 is possible convert 0 to False and 1 to True:

m1 = df['feature1'].astype(bool)
m2 = df['feature2'].astype(bool)
m3 = df['feature3'].astype(bool)
df['new_var'] = np.select([m1, m2, m3], ['1', '2', '3'], default='4')
print (df)
   customer_id  feature1  feature2  feature3 new_var
0            1         0         1         0       2
1            2         0         0         0       4
2            3         1         1         1       1
3            4         1         0         0       1
4            5         0         1         0       2
5            6         0         0         0       4
6            7         1         1         1       1
7            8         1         0         0       1
8            9         0         1         0       2
9           10         0         0         0       4
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks @jezrael - seems to work well if I try on this example but not on my code and I am trying to figure out why. Also, this is a solution that will not work for cases when feature 1,2,3 are all 1 when it only takes the first value (row 3 for example). – Shraddha Aug 10 '17 at 12:00
  • 1
    Works now! I had the 0/1 as strings which is why it was returning the default 4 everytime. Thanks! – Shraddha Aug 10 '17 at 12:35
0

Try:

df['new_var']=np.where(df['feature3']==1, '3', '4')
df['new_var']=np.where(df['feature2']==1,'2', df['new_var'])
df['new_var']=np.where(df['feature1']==1, '1', df['new_var'])
Muhammad Rasel
  • 704
  • 4
  • 9