1

I have the following problem:

Given a dataframe, say for example,

import pandas as pd
df = pd.DataFrame({'col1':[1,0,0,1],'col2':['B','B','A','A'],'col3':[1,2,3,4]})

In some other tool I can easily create a new column based on a condition, say

Create new column 'col3' with 'col2' if df['col1'] == '0' & ~df['col2'].isnull() else 'col1'

This other tool works it out pretty fast. I did not find a corresponding expression in python so far.

1.) I tried np.where which iterates through rows but does not allow dynamic values in the result corresponding to the exact row

2.) I've tried .apply(lambda ... ) which appears to be quiet slow.

I would be happy if you could find an elegant way to fix this problem. Thanx.

StephanH
  • 41
  • 9

3 Answers3

2

I think need numpy.where with notnull instead inverted isnull (thanks @jpp):

df = pd.DataFrame({'col1':[1,0,0,1],'col2':['B','B','A','A'],'col3':[1,2,3,4]})

df['col3'] = np.where((df['col1'] == 0) & (df['col2'].notnull()), df['col2'], df['col1'])
print (df)
   col1 col2 col3
0     1    B    1
1     0    B    B
2     0    A    A
3     1    A    1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

try this:

import numpy as np
df['new_col'] = np.where(df['col1'] == 0 & (~df['col2'].isnull()), df['col2'], df['col1'] )

np.where is faster than pd.apply: Why is np.where faster than pd.apply

Matteo M
  • 68
  • 1
  • 9
0

You can use df.loc:

df['col3'] = df['col1']
df.loc[(df['col1'] == 0 )& (~df['col2'].isnull()), 'col3'] = df['col2']
Andy
  • 450
  • 2
  • 8