3

I have a Pandas DataFrame, with columns 'time' and 'current'. It also has lots of other columns, but I don't want to use them for this operation. All values are floats.

df[['time','current']].head()

     time  current
1     0.0      9.6
2   300.0      9.3
3   600.0      9.6
4   900.0      9.5
5  1200.0      9.5

I'd like to calculate the rolling integral of current over time, such that at each point in time, I get the integral up to that point of the current over the time. (I realize that this particular operation is simple, but it's an example. I'm not really looking for this function, but the method as a whole)

Ideally, I'd be able to do something like this:

df[['time','current']].expanding().apply(scipy.integrate.trapezoid)

or

df[['time','current']].expanding(method = 'table').apply(scipy.integrate.trapezoid)

but neither of these work, as I'd like to take the 'time' column as the function's first argument, and the 'current' as the second. The function does work with one column (current alone), but I don't like dividing by timesteps separately afterwards.

It seems DataFrame columns can't be accessed within expanding().apply(). I've heard that internally the expanding is treated as an array, so I've also tried this:

df[['time','current']].expanding(method = 'table').apply(lambda x:scipy.integrate.trapezoid(x[0], x[1]))


df[['time','current']].expanding(method = 'table').apply(lambda x:scipy.integrate.trapezoid(x['time'], x['current']))

and variations, but I can never access the columns in expanding().

As a matter of fact, even using apply() on a plain DataFrame disallows using columns simultaneously, as each one is treated sequentially as a Series.

df[['time','current']].apply(lambda x:scipy.integrate.trapezoid(x.time,x.current))

...

AttributeError: 'Series' object has no attribute 'time'

This answer mentions the method 'table' for expanding(), but it wasn't out at the time, and I can't seem to figure out what it needs to work here. Their solution was simply to do it manually.

I've also tried defining the function first, but this returns an error too:

def func(x,y):
    return(scipy.integrate.trapezoid(x,y))

df[['time','current']].expanding().apply(func)

...

DataError: No numeric types to aggregate

Is what I'm asking even possible with expanding().apply()? Should I just do it another way? Can I apply expanding inside the apply()?

Thanks, and good luck.

Connor
  • 81
  • 4
  • `scipy.integrate.cumtrapz` is already a cumulative (expanding) calculation, so just use that? – ALollz May 24 '22 at 16:04
  • @ALollz I wasn't aware of that, I'll look into it. But that doesn't really get around the general problem. Thanks though. – Connor May 25 '22 at 07:02

1 Answers1

0

Overview

It is not yet fully implemented in pandas but there are things you can do to workaround. expanding() and rolling() plus .agg() or .apply() will deal column by column unless you precise method='table', (see Method 2).

Method 1

There is a workaround to get what you want as long as you output one column. The trick is to move columns to the index and then resetting it in the function: (don't do that with scipy.integrate.trapezoid because, as @ALollz said scipy.integrate.cumtrapz is already a cumulative (expanding) calculation)

def custom_func(serie):
   subDf = serie.reset_index()
   # work with the sub dataframe as you would do in a groupby
   # you have access to subDf.x and subDf.y
   return(scipy.integrate.trapezoid(subDf.x,subDf.y))

df.set_index(['y']).expanding().agg(custom_func)

Method 2

You can make use of the method='table' (available from pandas==1.3.0) in expanding() and rolling() In that case you need to use .apply(custom_func, raw=True,engine='numba') and write a function custom_func in numba python (beware of types) that will take the numpy array representation of your dataframe. If you do this your custom_func needs to output an array of the length that the ones in input so you might have to add dummy columns in the input in order to bypass this and rename your columns afterward.

min_periods=100

def custom_func(table):
    rep = np.zeros(len(table))
    # You need something like this if you want to use the min_periods argument
    if len(table) < min_periods :
        return rep
    # Do something with your numpy arrays
    return rep 

df.expanding(min_periods,method='table').apply(custom_func,raw=True,engine='numba')

# Rename
df.columns = ...
Kkameleon
  • 163
  • 2
  • 14