3

I have the following data:

data = [0.1, 0.2, 0.3, 0.4 , 0.5, 0.6, 0.7, 0.8, 0.5, 0.2, 0.1, -0.1,
        -0.2, -0.3, -0.4, -0.5, -0.6, -0.7, -0.9, -1.2, -0.1, -0.7]

Every time the data point changes more than the step size, I want to record it. If to does not I want to keep the old one, until the cumulative change is at least as much as the step size. I achieve this iteratively like so:

import pandas as pd
from copy import deepcopy
import numpy as np

step = 0.5
df_steps = pd.Series(data)
df = df_steps.copy()

today = None
yesterday = None
for index, value in df_steps.iteritems():
    today = deepcopy(index)
    if today is not None and yesterday is not None:
        if abs(df.loc[today] - df_steps.loc[yesterday]) > step:
            df_steps.loc[today] = df.loc[today]
        else:
            df_steps.loc[today] = df_steps.loc[yesterday]

    yesterday = deepcopy(today)

My final result is:

[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.7, 0.7, 0.7, 0.7, 0.1, 0.1, 0.1, 0.1, 0.1, -0.5, -0.5, -0.5, -0.5, -1.2, -0.1, -0.7]

Problem and Question

The problem is that this is achieved iteratively (I agree with the second answer here). My question is how can I achieve the same in a vectorized fashion?

Attempts

My attempt is the following, but it does not match the result:

(df.diff().cumsum().replace(np.nan, 0) / step).astype(int)
yatu
  • 86,083
  • 12
  • 84
  • 139
Newskooler
  • 3,973
  • 7
  • 46
  • 84

2 Answers2

3

Since a purely vectorised approach does not seem trivial, we can go with numba to compile the code down to C-level, and hence have a loopy but very per formant approach. Here's one way using numba's nopython mode:

from numba import njit, float64

@njit('float64[:](float64[:], float32)')
def set_at_cum_change(a, step):
    out = np.empty(len(a), dtype=float64)
    prev = a[0]
    out[0] = a[0]
    for i in range(1,len(a)):
        current = a[i]
        if np.abs(current-prev) > step:
            out[i] = current
            prev = current
        else:
            out[i] = out[i-1]
    return out

Which testing on the same array, gives:

data = np.array([0.1, 0.2, 0.3, 0.4 , 0.5, 0.6, 0.7, 0.8, 0.5, 0.2, 0.1, -0.1,
                 -0.2, -0.3, -0.4, -0.5, -0.6, -0.7, -0.9, -1.2, -0.1, -0.7])

out = set_at_cum_change(data,step= 0.5)

print(out)
array([ 0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.7,  0.7,  0.7,  0.7,  0.1,
        0.1,  0.1,  0.1,  0.1, -0.5, -0.5, -0.5, -0.5, -1.2, -0.1, -0.7])

If we check the timings, we see a huge 110000x speedup with the numba approach on a 22000 length array. This not only shows that numba is a great approach in these cases, but it also makes clear that using panda's iterrows/iteritems is almost always a bad idea:

def op(data):
    step = 0.5
    df_steps = pd.Series(data)
    df = df_steps.copy()

    today = None
    yesterday = None
    for index, value in df_steps.iteritems():
        today = deepcopy(index)
        if today is not None and yesterday is not None:
            if abs(df.loc[today] - df_steps.loc[yesterday]) > step:
                df_steps.loc[today] = df.loc[today]
            else:
                df_steps.loc[today] = df_steps.loc[yesterday]

        yesterday = deepcopy(today)
    return df_steps.to_numpy()

def fn(step):
    current = float('inf')
    i = yield

    while True:
        if abs(current - i) > step:
            current = i
            i = yield i
        else:
            i = yield current

def andrej(data):
    df = pd.DataFrame({'data': data})
    f = fn(0.5)
    next(f)
    df['new_data'] = df['data'].apply(lambda x: f.send(x))

data_large = np.tile(data, 1_000)
print(data_large.shape)
# (22000,)

np.allclose(op(data_large), set_at_cum_change(data_large, step=0.5))
# True

%timeit op(data_large)
# 5.78 s ± 329 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit andrej(data_large)
# 13.6 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit set_at_cum_change(data_large, step=0.5)
# 50.4 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
yatu
  • 86,083
  • 12
  • 84
  • 139
  • 1
    I must look at `numba` finally, heard good things about it. +1 for the benchmark :) – Andrej Kesely Jun 03 '20 at 21:09
  • @yatu, if I were to apply this function to a numpy matrix, would it be efficient just to make another loop over the columns or is there a better way? – Newskooler Jun 05 '20 at 15:33
  • 1
    You could easily adapt. Just make sure to set the shape of out accordingly. And now indexing is on 2D.lrt me know if you have problems @news – yatu Jun 05 '20 at 15:38
  • Thanks. Do you think stuff like `numpy.apply_along_axis()` would be quicker as opposed to a for col in columns / for row in rows? – Newskooler Jun 05 '20 at 15:48
  • No not at all. Applyalongaxis is just a convenience function which actually loops over the array, no vecorization. I used numba here since this seemed pretty tough to vectorize. Your best option here seems to be to adapt the above to work with 2d arrays. If you struggle with that, let me know and I'll help @news – yatu Jun 05 '20 at 15:57
  • I am doing it now. Will let you know. Thanks. One Q I have is: do you use `range(1,len(a))` instead of `range(len(a))` because we have later on out[i-1] and if i is zero it will be a fail? – Newskooler Jun 05 '20 at 16:09
  • 1
    That's just because I initialize with the first value in the arrays @news :) so just start looping from the second. – yatu Jun 05 '20 at 16:17
  • Another question that came up is why did you choose to use `np.zeros` instead of `np.empty` and then populate with zeros? I notice you provide a numba dtype and though maybe it has to do with that? – Newskooler Jun 15 '20 at 01:01
  • 1
    No, in fact `np.empty` would be a better choice, since it has less of a memory footprint @news – yatu Jun 15 '20 at 07:31
1

It's not vectorization, but this solution avoids deepcopy() and various .loc methods, so it should be faster:

data = [0.1, 0.2, 0.3, 0.4 , 0.5, 0.6, 0.7, 0.8, 0.5, 0.2, 0.1, -0.1, -0.2, -0.3, -0.4, -0.5, -0.6, -0.7, -0.9, -1.2, -0.1, -0.7]

def fn(step):
    current = float('inf')
    i = yield

    while True:
        if abs(current - i) > step:
            current = i
            i = yield i
        else:
            i = yield current

df = pd.DataFrame({'data': data})

f = fn(0.5)
next(f)
df['new_data'] = df['data'].apply(lambda x: f.send(x))

print(df)

Prints:

    data  new_data
0    0.1       0.1
1    0.2       0.1
2    0.3       0.1
3    0.4       0.1
4    0.5       0.1
5    0.6       0.1
6    0.7       0.7
7    0.8       0.7
8    0.5       0.7
9    0.2       0.7
10   0.1       0.1
11  -0.1       0.1
12  -0.2       0.1
13  -0.3       0.1
14  -0.4       0.1
15  -0.5      -0.5
16  -0.6      -0.5
17  -0.7      -0.5
18  -0.9      -0.5
19  -1.2      -1.2
20  -0.1      -0.1
21  -0.7      -0.7
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91