The context
I am looking to apply a ufuncs
(cumsum
in this case) to blocks of contiguous rows in a time serie, which is stored in a panda DataFrame.
This time serie is sorted according its DatetimeIndex.
Blocks are defined by a custom DatetimeIndex.
To do so, I came up with this (ok) code.
# input dataset
length = 10
ts = pd.date_range(start='2021/01/01 00:00', periods=length, freq='1h')
random.seed(1)
val = random.sample(range(1, 10+length), length)
df = pd.DataFrame({'val' : val}, index=ts)
# groupby custom datetimeindex
key_ts = [ts[i] for i in [1,3,7]]
df.loc[key_ts, 'id'] = range(len(key_ts))
df['id'] = df['id'].ffill()
# cumsum
df['cumsum'] = df.groupby('id')['val'].cumsum()
# initial dataset
In [13]: df
Out[13]:
val
2021-01-01 00:00:00 5
2021-01-01 01:00:00 3
2021-01-01 02:00:00 9
2021-01-01 03:00:00 4
2021-01-01 04:00:00 8
2021-01-01 05:00:00 13
2021-01-01 06:00:00 15
2021-01-01 07:00:00 14
2021-01-01 08:00:00 11
2021-01-01 09:00:00 7
# DatetimeIndex defining custom time intervals for 'resampling'.
In [14]: key_ts
Out[14]:
[Timestamp('2021-01-01 01:00:00', freq='H'),
Timestamp('2021-01-01 03:00:00', freq='H'),
Timestamp('2021-01-01 07:00:00', freq='H')]
# result
In [16]: df
Out[16]:
val id cumsum
2021-01-01 00:00:00 5 NaN -1
2021-01-01 01:00:00 3 0.0 3
2021-01-01 02:00:00 9 0.0 12
2021-01-01 03:00:00 4 1.0 4
2021-01-01 04:00:00 8 1.0 12
2021-01-01 05:00:00 13 1.0 25
2021-01-01 06:00:00 15 1.0 40
2021-01-01 07:00:00 14 2.0 14
2021-01-01 08:00:00 11 2.0 25
2021-01-01 09:00:00 7 2.0 32
The question
Is groupby
the most efficient in terms of CPU and memory in this case where blocks are made with contiguous rows?
I would think that with groupby
, a 1st read of the full the dataset is made to identify all rows to group together.
Knowing rows are contiguous in my case, I don't need to read the full dataset to know I have gathered all the rows of current group. As soon as I hit the row of the next group, I know calculations are done with previous group.
In case rows are contiguous, the sorting step is lighter.
Hence the question, is there a way to mention this to pandas to save some CPU?
Thanks in advance for your feedbacks, Bests