Is there a fastest way to add a column to a dataset by making some calculations on other columns?

Question

I have written a code (see below) to make some calculations on the dataset and add the results as a column to it.

ratio_list = []

for s,p,f in zip(A["s"], A["p"], A["f"]):
    m = A[(A["s"]==s) & (A["p"]==p) & (A["f"]<f)][['a', 't']].product(axis=1).sum()
    n = A[(A["s"]==s) & (A["p"]==p) & (A["f"]<f)]['a'].sum()

    if(n==0):
        ratio_list.append(0)
    else:
        ratio_list.append(m/n)

A["ratio"] = ratio_list

Here, A is a pandas data frame; s, p, f, a, t are column names. I want to add a column ratio consisting of the results of some calculations as you can see in the code.

This codes takes 10 minutes to run in a jupyter notebook. I wonder if I can write in a different way so that it takes less time?

A sample data with the result as a column "ratio": (in csv) (couldn't add a file)

,s,p,f,a,t,ratio
0,101,2018,2018-01-06,2.0,10.0,13.0
1,101,2018,2018-01-06,2.0,12.0,13.0
2,101,2018,2018-01-03,4.0,14.0,0.0
3,101,2018,2018-01-03,16.0,12.0,0.0
4,101,2018,2018-01-03,12.0,14.0,0.0
5,101,2018,2018-01-06,4.0,10.0,13.0
6,101,2018,2018-01-06,14.0,23.0,13.0
7,101,2018,2018-01-08,4.0,10.0,15.222222222222221
8,101,2018,2018-01-08,20.0,14.0,15.222222222222221
9,101,2018,2018-01-08,21.0,23.0,15.222222222222221
10,101,2018,2018-01-08,21.0,23.0,15.222222222222221
11,101,2018,2018-01-09,4.0,8.0,17.566666666666666
12,101,2018,2018-01-09,10.0,14.0,17.566666666666666
13,101,2018,2018-01-13,13.0,23.0,17.01492537313433
14,101,2018,2018-01-13,9.0,23.0,17.01492537313433
15,103,2018,2018-01-31,20.0,15.0,0.0
16,103,2018,2018-01-31,2.0,15.0,0.0
17,103,2018,2018-01-31,20.0,15.0,0.0
18,103,2018,2018-01-31,20.0,15.0,0.0
19,103,2018,2018-01-31,20.0,15.0,0.0

@Shrey I would not say that the apply method is fast, see for example: https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code — baccandr, Oct 24 '19 at 12:30
agree, even i had recently removed it from my code. But there are certain use cases where apply works very well. Looking at the question here, is suppose df.apply might be the easiest solution for him — Shrey, Oct 24 '19 at 12:33
hard to test if there's no sample dataset. Create a snippet and someone might be able to help you. I personally think this can be faster if vectorized using `numpy` — Joe, Oct 24 '19 at 12:38

jezrael · Accepted Answer · 2019-10-24T12:51:33.713

Use this custom function per groups by s and p columns by GroupBy.apply and mainly numpy.where:

def ratio(x):
    #2d mask for compare each value
    ma = x['f'].values < x['f'].values[:, None]
    #for pandas 0.24+
    #ma = x['f'].to_numpy() < x['f'].to_numpy()[:, None]
    #fill a and t values by mask
    a = np.where(ma, x['a'], 0)
    t = np.where(ma, x['t'], 0)

    #multiple and sum per 'columns'
    m = (a * t).sum(axis=1)
    n = a.sum(axis=1)

    #set column by condition
    x['ratio1'] = np.where(n == 0, 0, m/n)
    return x


A = A.groupby(['s','p']).apply(ratio)

print (A)
      s     p           f     a     t      ratio     ratio1
0   101  2018  2018-01-06   2.0  10.0  13.000000  13.000000
1   101  2018  2018-01-06   2.0  12.0  13.000000  13.000000
2   101  2018  2018-01-03   4.0  14.0   0.000000   0.000000
3   101  2018  2018-01-03  16.0  12.0   0.000000   0.000000
4   101  2018  2018-01-03  12.0  14.0   0.000000   0.000000
5   101  2018  2018-01-06   4.0  10.0  13.000000  13.000000
6   101  2018  2018-01-06  14.0  23.0  13.000000  13.000000
7   101  2018  2018-01-08   4.0  10.0  15.222222  15.222222
8   101  2018  2018-01-08  20.0  14.0  15.222222  15.222222
9   101  2018  2018-01-08  21.0  23.0  15.222222  15.222222
10  101  2018  2018-01-08  21.0  23.0  15.222222  15.222222
11  101  2018  2018-01-09   4.0   8.0  17.566667  17.566667
12  101  2018  2018-01-09  10.0  14.0  17.566667  17.566667
13  101  2018  2018-01-13  13.0  23.0  17.014925  17.014925
14  101  2018  2018-01-13   9.0  23.0  17.014925  17.014925
15  103  2018  2018-01-31  20.0  15.0   0.000000   0.000000
16  103  2018  2018-01-31   2.0  15.0   0.000000   0.000000
17  103  2018  2018-01-31  20.0  15.0   0.000000   0.000000
18  103  2018  2018-01-31  20.0  15.0   0.000000   0.000000
19  103  2018  2018-01-31  20.0  15.0   0.000000   0.000000

Is there a fastest way to add a column to a dataset by making some calculations on other columns?

1 Answers1