0

I have one dataframe, I'm trying to perform if function on the data, say if column A is 'ON' then Column E should be Col C + Col D otherwise Column E is MAX(col B, col C)-col C + col D).

df1:

T_ID   A     B    C     D
1      ON   100   90    0
2      OFF  150   120  -20
3      OFF  200   150   0
4      ON   400   320   0
5      ON   100    60  -10
6      ON   250   200   0

Resulting Data frame

T_ID   A     B    C     D    E
1      ON   100   90    0     90
2      OFF  150   120  -20    10
3      OFF  200   150   0     50
4      ON   400   320   0    320
5      ON   100    60  -10    50
6      ON   250   200   0    200

I am using the following code, any suggestion how can I do it in a better way?

condition = df1['A'].eq('ON')

df1['E'] = np.where(condition, df1['C'] + df1['D'], max(df1['B'],df1['C'])-df1['C']+df1['D'])
Shaido
  • 27,497
  • 23
  • 70
  • 73

2 Answers2

3

I think np.where here is good approach. For me working numpy.maximum, max raise error:

condition = df1['A'].eq('ON')

df1['E'] = np.where(condition, 
                    df1['C'] + df1['D'], 
                    np.maximum(df1['B'],df1['C'])-df1['C']+df1['D'])
print (df1)
   T_ID    A    B    C   D    E
0     1   ON  100   90   0   90
1     2  OFF  150  120 -20   10
2     3  OFF  200  150   0   50
3     4   ON  400  320   0  320
4     5   ON  100   60 -10   50
5     6   ON  250  200   0  200

df1['E'] = np.where(condition, 
                    df1['C'] + df1['D'], 
                    max(df1['B'],df1['C'])-df1['C']+df1['D'])
print (df1)

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Here apply is worse solution, because loops under the hood, so slow:

#6k rows -> for sample data np.where is 265 times faster like apply
df1 = pd.concat([df1] * 1000, ignore_index=True)
print (df1)


In [73]: %%timeit
    ...: condition = df1['A'].eq('ON')
    ...: 
    ...: df1['E1'] = np.where(condition, 
    ...:                     df1['C'] + df1['D'], 
    ...:                     np.maximum(df1['B'],df1['C'])-df1['C']+df1['D'])
    ...:                     
1.91 ms ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [74]: %%timeit
    ...: df1['E2'] = df1.apply(createE, axis=1)
    ...: 
507 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

I think apply function will be better solution. The code may like this:

def createE(row):
   if row.A == 'ON':
      return row.C + row.D
   else:
      return max(row.B, row.C) - row.C + row.D
df1['E'] = df1.apply(createE)

See more about apply at https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/

do thuan
  • 459
  • 3
  • 11
  • Yeah, I just google it, `np.where` is much more faster then `apply`. Here is the explanation https://stackoverflow.com/questions/41166348/why-is-np-where-faster-than-pd-apply – do thuan May 13 '20 at 06:08
  • hmmm, can you explain why do you think `apply function will be better solution.` ? – jezrael May 13 '20 at 06:10
  • Thanks for your suggestion, i tried using the code but getting the following error, can you please see AttributeError: 'Series' object has no attribute 'A' – Shashi Mistry May 13 '20 at 06:40