Im Kinda new to Python and Datascience.
I have these two Dataframes : df Dataframe
df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})
Date A
0 2014-11-21 11:00:00 1
1 2014-11-21 11:00:03 2
2 2014-11-21 11:00:04 5
3 2014-11-21 11:00:05 3
4 2014-11-21 11:00:07 9
5 2014-11-21 11:00:08 6
6 2014-11-21 11:00:10 3
7 2014-11-21 11:00:11 0
8 2014-10-24 10:00:55 8
9 2014-10-24 10:00:59 10
info Dataframe, this dataframe contains the Datetime range that my final df should contain
info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:08:00', '2014-10-24 10:55:00'], "Stop": ['2014-11-21 11:07:00', '2014-11-21 11:11:00', '2014-10-24 10:59:00']})
Start Stop
0 2014-11-21 11:00:00 2014-11-21 11:00:07
1 2014-11-21 11:00:08 2014-11-21 11:00:11
2 2014-10-24 10:00:55 2014-10-24 10:00:59
The goal is to calculate the cumulative sum in df
with a two seconds window
, if and only if the actual row in df
is in range of one of the rows in info
. For example cumulative sum for row with date 2014-11-21 11:00:08
should be 0
. Because it's in the start of a range, another example is row with date 2014-11-21 11:00:07
, its cumsum should be 12(9+3)
.
Here's what i've achieved until now :
import pandas as pd
import numpy as np
df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})
info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:08', '2014-10-24 10:00:55'], "Stop": ['2014-11-21 11:00:07', '2014-11-21 11:00:11', '2014-10-24 10:00:59']})
#info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05'], "Stop": ['2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:06', '2014-11-21 11:00:07']})
info['groupnum']=info.index
info.Start=pd.to_datetime(info.Start)
info.Stop=pd.to_datetime(info.Stop)
cinfo = info.set_index(pd.IntervalIndex.from_arrays(info.Start, info.Stop, closed='both'))['groupnum']
df['groupnum']=pd.to_datetime(df.Date).map(cinfo)
df['cum'] = df.groupby('groupnum').A.cumsum()
print(df)
Expected Result :
Date A groupnum cum
0 2014-11-21 11:00:00 1 0 1
1 2014-11-21 11:00:03 2 0 2
2 2014-11-21 11:00:04 5 0 7
3 2014-11-21 11:00:05 3 0 10
4 2014-11-21 11:00:07 9 0 12
5 2014-11-21 11:00:08 6 1 6
6 2014-11-21 11:00:10 3 1 9
7 2014-11-21 11:00:11 0 1 3
8 2014-10-24 10:00:55 8 2 8
9 2014-10-24 10:00:59 10 2 10
Actual Result :
Date A groupnum cum
0 2014-11-21 11:00:00 1 0 1
1 2014-11-21 11:00:03 2 0 3
2 2014-11-21 11:00:04 5 0 8
3 2014-11-21 11:00:05 3 0 11
4 2014-11-21 11:00:07 9 0 20
5 2014-11-21 11:00:08 6 1 6
6 2014-11-21 11:00:10 3 1 9
7 2014-11-21 11:00:11 0 1 9
8 2014-10-24 10:00:55 8 2 8
9 2014-10-24 10:00:59 10 2 18
But this is doing cumulative sum over the groupnum and i can't manage to cumulate only by 2s.
So is ther anyproper way to achieve this ? I'd be really grateful.
My english isn't that good, i hope that i explained you right