1

Im Kinda new to Python and Datascience.

I have these two Dataframes : df Dataframe

df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})

                  Date   A
0  2014-11-21 11:00:00   1
1  2014-11-21 11:00:03   2
2  2014-11-21 11:00:04   5
3  2014-11-21 11:00:05   3
4  2014-11-21 11:00:07   9
5  2014-11-21 11:00:08   6
6  2014-11-21 11:00:10   3
7  2014-11-21 11:00:11   0
8  2014-10-24 10:00:55   8
9  2014-10-24 10:00:59  10

info Dataframe, this dataframe contains the Datetime range that my final df should contain

info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:08:00', '2014-10-24 10:55:00'], "Stop": ['2014-11-21 11:07:00', '2014-11-21 11:11:00', '2014-10-24 10:59:00']})

                 Start                 Stop
0  2014-11-21 11:00:00  2014-11-21 11:00:07
1  2014-11-21 11:00:08  2014-11-21 11:00:11
2  2014-10-24 10:00:55  2014-10-24 10:00:59

The goal is to calculate the cumulative sum in df with a two seconds window, if and only if the actual row in df is in range of one of the rows in info. For example cumulative sum for row with date 2014-11-21 11:00:08 should be 0. Because it's in the start of a range, another example is row with date 2014-11-21 11:00:07, its cumsum should be 12(9+3).

Here's what i've achieved until now :

import pandas as pd
import numpy as np

df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})
info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:08', '2014-10-24 10:00:55'], "Stop": ['2014-11-21 11:00:07', '2014-11-21 11:00:11', '2014-10-24 10:00:59']})
#info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05'], "Stop": ['2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:06', '2014-11-21 11:00:07']})
info['groupnum']=info.index
info.Start=pd.to_datetime(info.Start)
info.Stop=pd.to_datetime(info.Stop)
cinfo = info.set_index(pd.IntervalIndex.from_arrays(info.Start, info.Stop, closed='both'))['groupnum']
df['groupnum']=pd.to_datetime(df.Date).map(cinfo)
df['cum'] = df.groupby('groupnum').A.cumsum()
print(df)

Expected Result :

                  Date   A  groupnum  cum
0  2014-11-21 11:00:00   1         0    1
1  2014-11-21 11:00:03   2         0    2
2  2014-11-21 11:00:04   5         0    7
3  2014-11-21 11:00:05   3         0   10
4  2014-11-21 11:00:07   9         0   12
5  2014-11-21 11:00:08   6         1    6
6  2014-11-21 11:00:10   3         1    9
7  2014-11-21 11:00:11   0         1    3
8  2014-10-24 10:00:55   8         2    8
9  2014-10-24 10:00:59  10         2   10

Actual Result :

                  Date   A  groupnum  cum
0  2014-11-21 11:00:00   1         0    1
1  2014-11-21 11:00:03   2         0    3
2  2014-11-21 11:00:04   5         0    8
3  2014-11-21 11:00:05   3         0   11
4  2014-11-21 11:00:07   9         0   20
5  2014-11-21 11:00:08   6         1    6
6  2014-11-21 11:00:10   3         1    9
7  2014-11-21 11:00:11   0         1    9
8  2014-10-24 10:00:55   8         2    8
9  2014-10-24 10:00:59  10         2   18

But this is doing cumulative sum over the groupnum and i can't manage to cumulate only by 2s.

So is ther anyproper way to achieve this ? I'd be really grateful.

My english isn't that good, i hope that i explained you right

Fragan
  • 792
  • 10
  • 29

2 Answers2

3

This method may not work for a 100M-row dataframe

To create the groupnum column, you can ufunc.outer with greater_equal and less_equal to compare each time from df with each start and stop from info and get where it is True row-wise with argmax. Then you can groupby on this column and use rolling on 2s with both

# create an boolean array to find in which range each row is
arr_bool = ( np.greater_equal.outer(df.Date.to_numpy(), info.Start.to_numpy())
             & np.less_equal.outer(df.Date.to_numpy(), info.Stop.to_numpy()))

# use argmax to find the position of the first True row-wise
df['groupnum'] = arr_bool.argmax(axis=1)

# select only rows within ranges, use set_index for later rolling and index alignment
df = df.loc[arr_bool.any(axis=1), :].set_index('Date')

# groupby groupnum, do the sum for a closed interval of 2s
df['cum'] = df.groupby('groupnum').rolling('2s', closed = 'both').A.sum()\
              .reset_index(level=0, drop=True) # for index alignment

df = df.reset_index() # get back date as a column
print (df)
                 Date   A  groupnum   cum
0 2014-11-21 11:00:00   1         0   1.0
1 2014-11-21 11:00:03   2         0   2.0
2 2014-11-21 11:00:04   5         0   7.0
3 2014-11-21 11:00:05   3         0  10.0
4 2014-11-21 11:00:07   9         0  12.0
5 2014-11-21 11:00:08   6         1   6.0
6 2014-11-21 11:00:10   3         1   9.0
7 2014-11-21 11:00:11   0         1   3.0
8 2014-10-24 10:00:55   8         2   8.0
9 2014-10-24 10:00:59  10         2  10.0

Edit: if arr_bool cannot be created this way you can try to iterate of info's rows and check independently if it above start and under stop:

# get once an array of all dates (should be faster)
arr_date = df.Date.to_numpy()

# create groups by sum 
df['groupnum'] = np.sum([i* (np.greater_equal(arr_date, start)&np.less_equal(arr_date, stop)) 
                         for i, (start, stop) in enumerate(zip(info.Start.to_numpy(), info.Stop.to_numpy()), 1)], axis=0) - 1

# remove the rows that are not in any range
df = df.loc[df['groupnum'].ge(0), :].set_index('Date')

# then same for the column cum
df['cum] = ...
Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • Thanks bud, why do you think it should'nt work with 100M rows ? is it because of `df = df.loc[arr_bool.any(axis=1), :].set_index('Date')` ?? – Fragan Jan 14 '20 at 15:38
  • @Arès no it is the `arr_bool` creation that will be HUGE, it is supposed to have same number of rows than `df`, and same number of columns than rows in `info` (it is why I asked about this earlier). and even if it is a Boolean array, it may be too big for your memory – Ben.T Jan 14 '20 at 15:41
  • What do you think of my solution to make the groupnumber ? `cinfo = info.set_index(pd.IntervalIndex.from_arrays(info.Start, info.Stop, closed='both'))['groupnum']` then `df['groupnum']=pd.to_datetime(df.Date).map(cinfo)`, do you think its any faster ? – Fragan Jan 14 '20 at 15:42
  • 1
    @Arès I have never used `pd.IntervalIndex.from_arrays` so I'm not sure how efficient (time and space) this is. Maybe on a 1M-row dataframe you can try both and see! I will add an alternative if the creation of arr_bool is not possible. but it will be with an iteration over `info` – Ben.T Jan 14 '20 at 15:49
  • 1
    @Arès see my edit, and one observation about loop:, indeed `iterrows` is not the best option but if vectorization is not possible, then list comprehension like in the edit is the best next thing (see this [answer](https://stackoverflow.com/a/55557758/9274732)) ;) – Ben.T Jan 14 '20 at 16:09
  • 1
    Thank for all those great info Ben, i really appreciate it. That helped me to understand better the vectorization and the cons of iterations in pandas. – Fragan Jan 14 '20 at 16:35
0

I tried the following:

from datetime import datetime 
df = pandas.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})
# !!! NOTE: you have typos in your code above
info = pandas.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:08', '2014-10-24 10:00:55'], "Stop": ['2014-11-21 11:00:07', '2014-11-21 11:00:11', '2014-10-24 10:00:59']})

df['Date'] = df['Date'].apply(lambda x : datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
info['Start'] = info['Start'].apply(lambda x : datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
info['Stop'] = info['Stop'].apply(lambda x : datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))

now we have the date properly transformed to datetime

for row in info.iterrows():
    mask = (df['Date']>=row[1]['Start'])&(df['Date']<=row[1]['Stop'])
    df.loc[mask, 'cumsum'] = df[mask]['A'].cumsum()

this will add a new column named cumsum to your dataframe. The result should be compliant with your request:

                Date    A   cumsum
0   2014-11-21 11:00:00 1   1.0
1   2014-11-21 11:00:03 2   3.0
2   2014-11-21 11:00:04 5   8.0
3   2014-11-21 11:00:05 3   11.0
4   2014-11-21 11:00:07 9   20.0
5   2014-11-21 11:00:08 6   6.0
6   2014-11-21 11:00:10 3   9.0
7   2014-11-21 11:00:11 0   9.0
8   2014-10-24 10:00:55 8   8.0
9   2014-10-24 10:00:59 10  18.0

UPDATE 1:

Sorry I lost one piece: in order to resample you can do:

df.index = df['Date']
df.drop(labels=['Date'], axis=1, inplace=True)
for row in info.iterrows():
    mask = (df.index>=row[1]['Start'])&(df.index<=row[1]['Stop'])
    df.loc[mask, 'cumsum'] = df[mask]['A'].resample('2S').sum()

but also this will not produce correct results if exist intervals of 2 seconds without values in it. To face this you may want to linear interpolate before resample ;)

UPDATE 2:

Now, the problem is that there is a mismatch between timeframes in the original dataframe and the ones after resample, in order to understand what is happening take a look to:

df.index = df['Date']
df.drop(labels=['Date'], axis=1, inplace=True)
res = []
for row in info.iterrows():
    mask = (df.index>=row[1]['Start'])&(df.index<=row[1]['Stop'])
    res.append(df[mask]['A'].resample('2S').sum())

res will contains 3 dataframes one for each interval in info:

2014-11-21 11:00:00    1
2014-11-21 11:00:02    2
2014-11-21 11:00:04    8
2014-11-21 11:00:06    9

2014-11-21 11:00:08    6
2014-11-21 11:00:10    3 

2014-10-24 10:00:54     8
2014-10-24 10:00:56     0
2014-10-24 10:00:58    10

as you can see your data has been properly resampled every 2 seconds starting from 0, but the indices does not match anymore, this causes NaN you seen in cumsum column in Update 1.

Now, I think the correct solution to achieve is the last one where data is properly and evenly sampled and summed every 2 seconds. In any case if this is not the result you want to achieve should be easy modify my solution in the direction you prefer ;)

Pierluigi
  • 1,048
  • 2
  • 9
  • 16
  • Isn't any other non 'iterative' way to do it ? I mean iteratively its quite simple but i wanna use as much as possible pandas methods to speed up the process (In real case i'll have like 100M rows, it would be really slow to do this with an iterows) – Fragan Jan 14 '20 at 14:44
  • 1
    @Arès what is the size of `info` because the iteration is on this one, not on `df`? – Ben.T Jan 14 '20 at 14:47
  • @Ben.T Oh yea right, thought iterrows was on `df` my bad, but do you think that loc is good with large amount of data too ? idk how it actually works – Fragan Jan 14 '20 at 14:49
  • 1
    @Arès doing a boolean indexing with `loc` is vectorized. But anyway the result with this method is your current result that is different of your expected result – Ben.T Jan 14 '20 at 14:53