0

I have a dataframe that includes the total number of products made in a factory for every day but it is a cumulative field and not daily values. I am trying to calculate daily values by just subtracting every day's cumulative number from next day's number. Here is the code I'm using. I am using loc to make sure that it inserts the new values in the original dataframe. This way it takes 10 seconds for 1000 rows which is kind of long, since the original data is much larger. Wondering if there is a faster way.

before:

    date        sum     
0   2020-03-24  10  
1   2020-03-25  50  
2   2020-03-26  90  
3   2020-03-27  140 
4   2020-03-28  180 

code:

for i in range(1, 1000):
    data.loc[i, 'daily_products'] = data.loc[i, 'sum'] - data.loc[i-1, 'sum']

after:

    date        sum     daily_products
0   2020-03-24  10  
1   2020-03-25  50      40
2   2020-03-26  90      40
3   2020-03-27  140     50
4   2020-03-28  180     40

and the time it takes for 1000 rows:

Total runtime of the program is 9.468996286392212
Nick ODell
  • 15,465
  • 3
  • 32
  • 66
molosoco
  • 9
  • 1

2 Answers2

3

Let use

data['daily_products'] = data['sum'].diff()

159 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

versus

for i in range(1, len(data)-1):
    data.loc[i, 'daily_products'] = data.loc[i, 'sum'] - data.loc[i-1, 'sum']

716 µs ± 26.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

and...

data['daily_products'] = data['sum'] - data['sum'].shift(-1)

305 µs ± 6.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
1

To address your question specifically, something you can do is using .shift

data['daily_products'] = data['sum'] - data['sum'].shift(-1)

Or cf @Scott's answer, in your case with .diff(-1)


Note that sum is a method of pd.DataFrame, which means that this is somehow a reserved name you should not use. Indeed, using such name for your variable prevents you from getting it by doing data.sum. By opposition, you can do data.daily_products since this column name does not enter in conflict with pandas's name space. Once it has been defined though.

keepAlive
  • 6,369
  • 5
  • 24
  • 39