Causal resampling: Sum over the last X

Question

Say I have the following values:

                                   money_spent
time                 
2014-10-06 17:59:40.016000-04:00      1.832128
2014-10-06 17:59:41.771000-04:00      2.671048
2014-10-06 17:59:43.001000-04:00      2.019434
2014-10-06 17:59:44.792000-04:00      1.294051
2014-10-06 17:59:48.741000-04:00      0.867856

I am hoping to measure much money is spent every 2 seconds. More specifically, for every timestamp in the output, I need to see the money spent within the last 2 seconds.

When I do:

df.resample('2S', how='last')

I get:

                                money_spent
time               
2014-10-06 17:59:40-04:00          2.671048
2014-10-06 17:59:42-04:00          2.019434
2014-10-06 17:59:44-04:00          1.294051
2014-10-06 17:59:46-04:00               NaN
2014-10-06 17:59:48-04:00          0.867856

which is not what I would expect. To start with, note that the first entry in the resampled df is 2.671048, but that is at time 17:59:40, even though, according to the original dataframe, no money was spent yet. How is that possible?

score 5 · Accepted Answer · edited Jun 20 '20 at 09:12

5

try using how=np.sum :

df.resample('2S', how=np.sum, closed='left', label='right')

Edit:

As for closed and label:

It means: from the left-closed interval, and labeled with the date from the right end of the interval. (of 2 seconds e.g. [1, 1.2, 1.5, 1.9, 2) ) . And from the docs:

closed : {‘right’, ‘left’} Which side of bin interval is closed

label : {‘right’, ‘left’} Which bin edge label to label bucket with

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 08 '14 at 02:13

Nasser Al-Wohaibi

4,562
2
36
28

Thanks! Why do I need `closed='left'?` I think I understand well what the other parameters do. – Amelio Vazquez-Reina Oct 08 '14 at 02:23
Edited as per your comment – Nasser Al-Wohaibi Oct 08 '14 at 02:45
Thanks Nasser. I should have clarified this. I had read the explanation of `closed` on the docs, but was not sure I understood what it meant in this context. I guess we can use it to specify if we want the edges of the bins to be included in the computation. In this particular case, we could have chosen to include closed='right', correct? (since I didn't specify how to treat the edges of the bins). Or am I misunderstanding what `closed` means? – Amelio Vazquez-Reina Oct 08 '14 at 02:56
1

That depends on what is logical for your analysis. From my simple understanding of this parameter, and for your particular case, specifing closed='right' is still valid however unusual in context. Since it will exclude any value at the start and count the value at the end. So the logical parameter value is to use closed='left' e.g. one start a sum from 00:00:00 to 00:00:01.9999 not from 00:00:00.00009 to 00:00:02 – Nasser Al-Wohaibi Oct 08 '14 at 22:31

behzad.nouri · Answer 2 · 2014-10-08T02:32:18.733

You can add events to the frame, in the way that each dollar spent will leave the system two seconds afterwards; Then all you need is a cumulative sum;

There is a chance that two events may overlap; in that case, after the very last step, you need to .drop_duplicate time indices, taking the last value of money_spent for each duplicate value of time:

>>> df
                            money_spent
time                                   
2014-10-06 21:59:40.016000        1.832
2014-10-06 21:59:41.771000        2.671
2014-10-06 21:59:43.001000        2.019
2014-10-06 21:59:44.792000        1.294
2014-10-06 21:59:48.741000        0.868

>>> xdf = df.copy()   # make a copy of the original frame
>>> xdf['money_spent'] *= -1  # negate the value of `money_spent`
>>> xdf.index += np.timedelta64(2, 's')  # move the time span 2 seconds

now, concat with the original frame, sort_index and find cumulative sum:

>>> pd.concat([df, xdf]).sort_index().cumsum()
                            money_spent
2014-10-06 21:59:40.016000    1.832e+00
2014-10-06 21:59:41.771000    4.503e+00
2014-10-06 21:59:42.016000    2.671e+00
2014-10-06 21:59:43.001000    4.690e+00
2014-10-06 21:59:43.771000    2.019e+00
2014-10-06 21:59:44.792000    3.313e+00
2014-10-06 21:59:45.001000    1.294e+00
2014-10-06 21:59:46.792000   -4.441e-16
2014-10-06 21:59:48.741000    8.679e-01
2014-10-06 21:59:50.741000   -4.441e-16

There is a floating point precision error that shows up as very small values like -4.441e-16, otherwise the numbers look correct to me:

>>> _['money_spent'].round(15)
2014-10-06 21:59:40.016000    1.832
2014-10-06 21:59:41.771000    4.503
2014-10-06 21:59:42.016000    2.671
2014-10-06 21:59:43.001000    4.690
2014-10-06 21:59:43.771000    2.019
2014-10-06 21:59:44.792000    3.313
2014-10-06 21:59:45.001000    1.294
2014-10-06 21:59:46.792000   -0.000
2014-10-06 21:59:48.741000    0.868
2014-10-06 21:59:50.741000   -0.000
Name: money_spent, dtype: float64

score 0 · Answer 3 · answered Oct 08 '14 at 01:06

0

The reason the first element in your return array is 2.67 is that you're using the last method for how to resample your data. If you want to change to where the first resampled point will read 1.832128, use the how='first' kwarg.

answered Oct 08 '14 at 01:06

tyleha

3,319
1
17
25

I think that does not do it. I want to integrate spend within the last 2 seconds. I think I need to use `sum` and label=`right`.. but am not sure. – Amelio Vazquez-Reina Oct 08 '14 at 01:11

Causal resampling: Sum over the last X

3 Answers3

Linked