1

I am just starting to learn python so please ignore or let me know if I am not asking this questions properly.

I have a dataframe df

df = pd.DataFrame({'A' : [5,6,0,-4], 'B' : [1,2,3,5]})
A B
5 1
6 2
0 3
-4 5

I want to do the below:

IF A greater than 0 then DO;

----IF B < 2 then NEWVALUE = A * 2

----ELSE A = A/2

IF A = 0 then CALC_FIELD = NEWVALUE + 2 # use the NEWVALUE column that is just created in prior step

I am using the below combination of 'for' and 'if' to get the output. But is there a better and efficient way to do this ?

for i in range(df.shape[0]):
    if (df.loc[i, 'A'] > 0) & (df.loc[i, 'B'] < 2): df.loc[i,'NEWVALUE'] =  df.loc[i,'A'] * 2
    if (df.loc[i, 'A'] <= 0): df.loc[i, 'NEWVALUE'] = df.loc[i,'A'] / 2
    if (df.loc[i, 'A'] == 0): df.loc[i, 'CALC_FIELD'] = df.loc[i, 'NEWVALUE'] + 2
**EXPECTED OUTPUT**
|  A | B  | C  | D  |
| -- | -- | -- |--  |
| 5  | 1  | 10 | Nan|
| 6  | 2  | Nan| Nan| 
| 0  | 3  | 0  | 2  |
| -4 | 5  | -2 |Nan | 
sophocles
  • 13,593
  • 3
  • 14
  • 33
Deb
  • 499
  • 2
  • 15
  • is there a typo in your output? isn't the ```nan``` in your C column supposed to be 3? Since A(6) > 0 and B(2) is not less than 2, it should be 6/2 = 3? – sophocles Mar 16 '21 at 14:58
  • Why does NEWVALUE get populated for last row? `A` is **less than 0** there. – Sayandip Dutta Mar 16 '21 at 15:01

3 Answers3

2

I think there's a typo in your desired output.

I also wouldn't use a for loop for this kind of calculation. I would suggest using np.select if you had many conditions and you wanted to create 1 column, but since your aim is to create 2 columns, one simple way would be to use np.where twice:

import numpy as np
df['C']= np.where((df['A']>0) & (df['B']<2) ,df['A'].mul(2),df['A'].div(2))
df['D']= np.where(df['A'].eq(0),df['C'].add(2),np.nan)

Which gets:

df
   A  B     C    D
0  5  1  10.0  NaN
1  6  2   3.0  NaN
2  0  3   0.0  2.0
3 -4  5  -2.0  NaN
sophocles
  • 13,593
  • 3
  • 14
  • 33
2

Just a little to add on the good existing answer. I agree that you should try to avoid the for loop if possible, because the built-in vectorized operations will scale much better. See here for the no-nos of iteration over a DataFrame.

An alternative to np.where is to just use your same loc logic, but with boolean indexing*:

df.loc[(df['A'] > 0) & (df['B'] < 2), 'NEWVALUE'] = df['A'] * 2
df.loc[df['A'] <= 0, 'NEWVALUE'] = df['A'] / 2
df.loc[df['A'] == 0, 'CALC_FIELD'] = df['NEWVALUE'] + 2

* I agree with the other comments that you may want to double check that your logic is doing what you purport it to do; it wasn't totally clear in the post.

In this acute case, this solution is actually slightly slower than the for loop, interestingly enough (using %%timeit):

  • for loop: 1.12 ms ± 69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • boolean indexing: 2.41 ms ± 47.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

However, the real benefit is observed with more data. If you make your dataset only 10x bigger:

import pandas as pd

df = pd.DataFrame({'A' : [5,6,0,-4]*10, 'B' : [1,2,3,5]*10})

The vectorized indexing is much faster:

  • for loop: 10.8 ms ± 83 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • boolean indexing: 2.37 ms ± 37.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The discrepancy grows with the size of the dataset.

Tom
  • 8,310
  • 2
  • 16
  • 36
0

I tried the same code snippet in jupyter notebook, Got the desired output.

import pandas as pd
df = pd.DataFrame({'A' : [5,6,0,-4], 'B' : [1,2,3,5]})
for i in range(df.shape[0]):
    if (df.loc[i, 'A'] > 0) & (df.loc[i, 'B'] < 2): df.loc[i,'C'] =  df.loc[i,'A'] * 2
    if (df.loc[i, 'A'] <= 0): df.loc[i, 'C'] = df.loc[i,'A'] / 2
    if (df.loc[i, 'A'] == 0): df.loc[i, 'D'] = df.loc[i, 'C'] + 2
df

which gives the expected output:

    A   B     C      D
0   5   1   10.0    NaN
1   6   2   NaN     NaN
2   0   3   0.0     2.0
3  -4   5  -2.0     NaN
Tom
  • 8,310
  • 2
  • 16
  • 36