Apply a function on a DataFrame that depend on the previous row values

Question

I want to detect max and min of a time series, always looking on the left side. Looking right would be looking in the future since it's analyzed in live. My method:

When increasing, you update the variable max
When decreasing, you update the variable min
Then, when decreasing, if the value is inferior to 50% of (max - min), you consider that you are defining a new low
And vise versa when increasing

It translates like this:

import pandas as pd

timerange = pd.date_range(start='1/1/2018', end='1/31/2018')
data = [0, 1, 2, 3, 4, 2, 1, 0, -1, 0, 3, 2, 1, 1, 0.5, 0, 1, 2, 4, 5, 6, 7, 8, 4, -2, -4, 0, 5, 3, 2, 0]
timeseries = pd.DataFrame(index=timerange, data=data, columns=['Value'])

max = data[0]
min = data[0]
pct = .5
tendancy = False
for now in timeseries.index:

    value = timeseries.loc[now, 'Value']

    if value >= max:
        max = value
    if value <= min:
        min = value

    range = max-min

    # Cancel the previous max value when going up if the 50% rule is triggered
    if value >= min + range * pct and tendancy != 'up':
        tendancy = 'up'
        max = value
    # Cancel the previous min value when going down if the 50% rule is triggered
    elif value <= max - range * pct and tendancy != 'down':
        tendancy = 'down'
        min = value

    ratio = (value-min)/(max-min)

    timeseries.loc[now, 'Max'] = max
    timeseries.loc[now, 'Min'] = min
    timeseries.loc[now, 'Ratio'] = ratio

timeseries[['Value', 'Min', 'Max']].plot()
timeseries['Ratio'].plot(secondary_y=True)

It works as expected and as a result, looking at the Ratio variable, you know if you are currently defining a new low (0) or a new high (1), whatever the amplitude or the frequency of the signal.

However, on my real data (~200 000 rows), it is super long. I was wondering if there is a way to optimize this, especially using the .apply() method of DataFrame. But since results depend on the previous row, I don't know if this method is applicable.

It's a bad practice using keywords `max,min, range` as variables. — Quang Hoang, May 12 '20 at 18:31

Ben.T · Accepted Answer · 2020-05-12T19:27:57.367

The first and easy speed up you can do is instead of iterating over the index and accessing each time with loc, is to iterate over the value directly and append into a list the three results (max-, min-, ratio-) you want like:

max_ = data[0] #NOTE: I rename the variables with _ to avoid using builtin method names
min_ = data[0]
pct = .5
tendancy = False
l_res = [] # list for the results
for value in timeseries['Value'].to_numpy(): #iterate over the values

    if value >= max_:
        max_ = value
    if value <= min_:
        min_ = value

    range_ = max_-min_

    # Cancel the previous max value when going up if the 50% rule is triggered
    if value >= min_ + range_ * pct and tendancy != 'up':
        tendancy = 'up'
        max_ = value
    # Cancel the previous min value when going down if the 50% rule is triggered
    elif value <= max_ - range_ * pct and tendancy != 'down':
        tendancy = 'down'
        min_ = value

    ratio = (value-min_)/(max_-min_)
    # append the three results in the list
    l_res.append([max_, min_, ratio])

# create the three columns outside of the loop
timeseries[['Max', 'Min','Ratio']] = pd.DataFrame(l_res, index=timeseries.index)

In terms of timing, I put both ways in functions (f_maxime for yours and f_ben for this one) and it gives:

%timeit f_maxime(timeseries)
# 16.4 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit f_ben(timeseries)
# 651 µs ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

so this way is about 25 time faster, and for 200K rows, I think it should still be 25x time faster. Also I checked that the result is the same:

(f_ben(timeseries).fillna(0) == f_maxime(timeseries).fillna(0)).all().all()
#True

regarding the use of apply, I don't think there is any value in this case for speeding up the code, see this

Works great, and thanks for the documentation on the `apply` method, I learnt a lot — Maxime, May 12 '20 at 20:55
@Maxime I have tried to vectorize it, but the `if...elif...` part seems difficult to make it this way. If you really need another level of speedup, I suggest you have a look at Numba, but it is another library :) — Ben.T, May 12 '20 at 21:04
It's been several times I heard about this library, I'll have a look. Thanks! — Maxime, May 13 '20 at 16:02

Apply a function on a DataFrame that depend on the previous row values

1 Answers1