Is there a faster way to fill in a new column in a dataframe based on exsiting ones than using loc in python?

Question

I have a dataframe that includes the total number of products made in a factory for every day but it is a cumulative field and not daily values. I am trying to calculate daily values by just subtracting every day's cumulative number from next day's number. Here is the code I'm using. I am using loc to make sure that it inserts the new values in the original dataframe. This way it takes 10 seconds for 1000 rows which is kind of long, since the original data is much larger. Wondering if there is a faster way.

before:

    date        sum     
0   2020-03-24  10  
1   2020-03-25  50  
2   2020-03-26  90  
3   2020-03-27  140 
4   2020-03-28  180

code:

for i in range(1, 1000):
    data.loc[i, 'daily_products'] = data.loc[i, 'sum'] - data.loc[i-1, 'sum']

after:

    date        sum     daily_products
0   2020-03-24  10  
1   2020-03-25  50      40
2   2020-03-26  90      40
3   2020-03-27  140     50
4   2020-03-28  180     40

and the time it takes for 1000 rows:

Total runtime of the program is 9.468996286392212

What about `data['daily_products'] = data['sum'] - data['sum'].shift(-1) ` ? — keepAlive, Nov 18 '20 at 16:23

score 3 · Answer 1 · answered Nov 18 '20 at 16:24

3

Let use

data['daily_products'] = data['sum'].diff()

159 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

versus

for i in range(1, len(data)-1):
    data.loc[i, 'daily_products'] = data.loc[i, 'sum'] - data.loc[i-1, 'sum']

716 µs ± 26.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

and...

data['daily_products'] = data['sum'] - data['sum'].shift(-1)

305 µs ± 6.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

answered Nov 18 '20 at 16:24

Scott Boston

147,308
15
139
187

The comparison would have been even better with `.diff(-1)` actually. Some "frictions" may occur when not defaulting. – keepAlive Nov 18 '20 at 16:39
.diff(-1) provides a negative number not correct in my opinion. However, it doesn't affect the times much at all. Mabye 10 microseconds. – Scott Boston Nov 18 '20 at 17:26
Thanks guys. very helpful – molosoco Nov 24 '20 at 21:33

keepAlive · Answer 2 · 2020-11-18T16:42:32.073

To address your question specifically, something you can do is using .shift

data['daily_products'] = data['sum'] - data['sum'].shift(-1)

Or cf @Scott's answer, in your case with .diff(-1)

Note that sum is a method of pd.DataFrame, which means that this is somehow a reserved name you should not use. Indeed, using such name for your variable prevents you from getting it by doing data.sum. By opposition, you can do data.daily_products since this column name does not enter in conflict with pandas's name space. Once it has been defined though.

Is there a faster way to fill in a new column in a dataframe based on exsiting ones than using loc in python?

2 Answers2