1

The context

I am looking to apply a ufuncs (cumsum in this case) to blocks of contiguous rows in a time serie, which is stored in a panda DataFrame. This time serie is sorted according its DatetimeIndex.

Blocks are defined by a custom DatetimeIndex.

To do so, I came up with this (ok) code.

# input dataset
length = 10
ts = pd.date_range(start='2021/01/01 00:00', periods=length, freq='1h')
random.seed(1)
val = random.sample(range(1, 10+length), length)
df = pd.DataFrame({'val' : val}, index=ts)

# groupby custom datetimeindex
key_ts = [ts[i] for i in [1,3,7]]
df.loc[key_ts, 'id'] = range(len(key_ts))
df['id'] = df['id'].ffill()

# cumsum
df['cumsum'] = df.groupby('id')['val'].cumsum()
# initial dataset
In [13]: df
Out[13]: 
                     val
2021-01-01 00:00:00    5
2021-01-01 01:00:00    3
2021-01-01 02:00:00    9
2021-01-01 03:00:00    4
2021-01-01 04:00:00    8
2021-01-01 05:00:00   13
2021-01-01 06:00:00   15
2021-01-01 07:00:00   14
2021-01-01 08:00:00   11
2021-01-01 09:00:00    7
# DatetimeIndex defining custom time intervals for 'resampling'.
In [14]: key_ts
Out[14]: 
[Timestamp('2021-01-01 01:00:00', freq='H'),
 Timestamp('2021-01-01 03:00:00', freq='H'),
 Timestamp('2021-01-01 07:00:00', freq='H')]
# result
In [16]: df
Out[16]: 
                     val   id  cumsum
2021-01-01 00:00:00    5  NaN      -1
2021-01-01 01:00:00    3  0.0       3
2021-01-01 02:00:00    9  0.0      12
2021-01-01 03:00:00    4  1.0       4
2021-01-01 04:00:00    8  1.0      12
2021-01-01 05:00:00   13  1.0      25
2021-01-01 06:00:00   15  1.0      40
2021-01-01 07:00:00   14  2.0      14
2021-01-01 08:00:00   11  2.0      25
2021-01-01 09:00:00    7  2.0      32

The question

Is groupby the most efficient in terms of CPU and memory in this case where blocks are made with contiguous rows? I would think that with groupby, a 1st read of the full the dataset is made to identify all rows to group together.

Knowing rows are contiguous in my case, I don't need to read the full dataset to know I have gathered all the rows of current group. As soon as I hit the row of the next group, I know calculations are done with previous group.

In case rows are contiguous, the sorting step is lighter.

Hence the question, is there a way to mention this to pandas to save some CPU?

Thanks in advance for your feedbacks, Bests

pierre_j
  • 895
  • 2
  • 11
  • 26

1 Answers1

1

group_by is clearly not the fastest solution here because it should either use a slow sort or slow hashing operations to group the values.

What you want to implement is called a segmented cumulative sum. You can implement this quite efficiently using Numpy, but this is a bit tricky to implement (especially due to the NaN values) and not the fastest solution because multiple one need multiple steps iterating over all the id/valcolumns. The fastest solution is to use something like Numba to do this very quickly in one step.

Here is an implementation:

import numpy as np
import numba as nb

# To avoid the compilation cost at runtime, use: 
# @nb.njit('int64[:](float64[:],int64[:])')
@nb.njit
def segmentedCumSum(ids, values):
    size = len(ids)
    res = np.empty(size, dtype=values.dtype)
    if size == 0:
        return res
    zero = values.dtype.type(0)
    curValue = zero
    for i in range(size):
        if not np.isnan(ids[i]):
            if i > 0 and ids[i-1] != ids[i]:
                curValue = zero
            curValue += values[i]
            res[i] = curValue
        else:
            res[i] = -1
            curValue = zero
    return res

df['cumsum'] = segmentedCumSum(df['id'].to_numpy(), df['val'].to_numpy())

Note that ids[i-1] != ids[i] might fail with big floats because of their imprecision. The best solution is to use integers and -1 to replace the NaN value. If you do want to keep the float values, you can use the expression np.abs(ids[i-1]-ids[i]) > epsilon with a very small epsilon. See this for more information.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • 1
    Thanks a lot @Jérôme! Ok. Hmmm, this segmented cumulative sum seems to me a 'basic' function. I am surprised nothing 'efficient' exists in pandas. I will propose a ticket if you don't mind. Thanks a lot for your help. Ok, let's do it with Numba then! :) Thanks for the typing help in the decorator. I have 'always' difficulties to define them for Numba. Will try the 'guvectorize' one. Might get additional perf? :) – pierre_j Apr 25 '21 at 14:23
  • @pierre_j AFAIK, you can probably write a code to do that in Pandas like you can in Numpy but both will likely be complex or less efficient than this. I am not aware of any built-in function in Pandas to do exactly that so far although it would be convenient. I have never tried using `guvectorize` but you may get a slight improvement indeed. The code should be already very fast (it is able to compute several GB/s of data on my desktop computer) ;) . – Jérôme Richard Apr 25 '21 at 14:42
  • thanks! Coming back on one of your comment regarding 'id' being float instead of int. In my proposed dataset, what I actually really want to compare are timestamps, and I am turning that into 'ids' indeed. I will try to directly use timestamp maybe? But the last time I have tried to use timestamps in Numba has been a failure... :) – pierre_j Apr 25 '21 at 16:21
  • Yeah this is a bit more tricky to do such thing with Numba. If you do not need date-time functions, I think this is simpler to convert the column to np.int64 values (see [here](https://stackoverflow.com/questions/47562634)). I think the type should be large enough to hold all the possible (interesting) date-times. – Jérôme Richard Apr 25 '21 at 16:41