dataframe column created from another column variations via vectorization

Question

I have a dataframe like this with two columns, date and indicator :

date                  indicator 
2019-10-26 06:48:49   -1.073525
2019-10-27 06:19:31   -0.375276
2019-10-28 06:50:44    0.643764
2019-10-29 07:21:35    0.863731
2019-10-30 07:52:36    1.022312
2019-10-31 08:23:18    1.125842
2019-11-01 08:52:35    0.863731
2019-11-02 09:16:28    0.831097
2019-11-03 09:42:20    0.529638
2019-11-04 10:09:01   -0.735926
2019-11-05 10:34:39   -1.743626
2019-11-06 11:00:39   -0.872055

the idea would be to create a column signal, without doing a loop, which works like this :

if indicator < -1 then :
- if signal was 0, it becomes 1 and keeps this value until indicator become positive
- if signal was already 1 it doesn't change
if indicator > 1 then :
- if signal was 0, it becomes -1 and keeps this value until indicator become negative
- if signal was already -1 it doesn't change
if indicator changes sign :
- if signal was -1 or 1 it becomes 0
- if signal was 0 it doesn't change

so it would give something like :

date                  indicator    signal 
2019-10-26 06:48:49   -1.073525      1
2019-10-27 06:19:31   -0.375276      1
2019-10-28 06:50:44    0.643764      0
2019-10-29 07:21:35    0.863731      0 
2019-10-30 07:52:36    1.022312     -1
2019-10-31 08:23:18    1.125842     -1
2019-11-01 08:52:35    0.863731     -1 
2019-11-02 09:16:28    0.831097     -1 
2019-11-03 09:42:20    0.529638     -1
2019-11-04 10:09:01   -0.735926      0  
2019-11-05 10:34:39   -1.743626      1
2019-11-06 11:00:39   -0.872055      1

i tried to create some column with 1 and -1 depending on indicator value then do a diff and cumulative sum but didn't succeed to obtain this exact column.

`if indicator changes sign, then from 1 or -1 to 0` I didn't quite get this part. Please elaborate on this. — Rishabh Kumar, Feb 28 '21 at 04:16
Even based on these conditions `if indicator < -1 then 1` and `if indicator > 1 then -1` , the column doesn't follow these properties always. Or maybe I am not getting them. — Rishabh Kumar, Feb 28 '21 at 04:23
As @Rishabh Kumar exactly mentioned, there are inconsistencies in what you indicated you want to do. Redefine values for "signal" (especially for the cases of "0") or what you want to achieve. E.g. in the second row, based on what you wrote "signal" should have a value of 0. — dimi_fn, Feb 28 '21 at 05:44
Sorry, i've just edited with clarification, should be good now, thanks ! — TmSmth, Feb 28 '21 at 12:44
your description is incomplete: 1) what if indicator<-1 and signal was -1? 2) what if indicator > 1 and signal was +1? 3) what to do for the very first row? in any case, I am afraid what you describe here cannot be done without a loop (only for the rows where indicator changes signs you can since there signal is always 0 no-matter what was the previous signal value) — Amir, Mar 02 '21 at 08:33
@Amir the first two situations can't exist as it come back to 0 if `indicator` changes sign, for your third point I consider that the value of `signal` was 0 before the first row — TmSmth, Mar 02 '21 at 09:15
@d3dave i tried to do a column of `signal.diff()` then a column with `signal.diff().cumsum()` as i did for another one but this time it doesn't work — TmSmth, Mar 02 '21 at 09:16

Sayandip Dutta · Accepted Answer · 2021-03-04T20:33:18.670

A pure numpy solution without using np.vectorize:

indicator_np = df.indicator.to_numpy()
indicator_abs_gt1 = np.abs(indicator_np)>1
np.sign(indicator_np, out=indicator_np)
signchanges = np.ediff1d(indicator_np, to_begin=0).astype(bool)
signal = np.where(
    indicator_abs_gt1 | signchanges, 
    -indicator_np* indicator_abs_gt1, 
    np.nan
)
mask = np.isnan(signal)                   ##
idx = np.arange(mask.size) * ~mask        ##  Inspired from Divakar's answer -
np.maximum.accumulate(idx, out=idx)       ##  https://stackoverflow.com/a/41191127/5431791
df['signal'] = signal[idx].astype(int)    ##

>>> df
date                  indicator    signal 
2019-10-26 06:48:49   -1.073525      1
2019-10-27 06:19:31   -0.375276      1
2019-10-28 06:50:44    0.643764      0
2019-10-29 07:21:35    0.863731      0 
2019-10-30 07:52:36    1.022312     -1
2019-10-31 08:23:18    1.125842     -1
2019-11-01 08:52:35    0.863731     -1 
2019-11-02 09:16:28    0.831097     -1 
2019-11-03 09:42:20    0.529638     -1
2019-11-04 10:09:01   -0.735926      0  
2019-11-05 10:34:39   -1.743626      1
2019-11-06 11:00:39   -0.872055      1

Although the performance improvement from the currently accepted solution for the sample data provided in the question is negligible, the improvement is huge when the size of data increases significantly.

SETUP

def pure_np(series):
    indicator_np = series.to_numpy()
    indicator_abs_gt1 = np.abs(indicator_np)>1
    np.sign(indicator_np, out=indicator_np)
    signchanges = np.ediff1d(indicator_np, to_begin=0).astype(bool)
    signal = np.where(indicator_abs_gt1 | signchanges, -indicator_np* indicator_abs_gt1, np.nan)
    mask = np.isnan(signal)
    idx = np.arange(mask.size) * ~mask
    np.maximum.accumulate(idx, out=idx)
    return signal[idx].astype(int)

def conditions(x):
    global s
    if x > 1:
        s = -1
    elif x < -1:
        s = 1
    else:
        if ((s == -1) & (x < 0)) | ((s == 1) & (x > 0)) :
            s = 0
    return s
df['signal'] = [0] * len(df)
TmSmth = np.vectorize(conditions)

Benchmark:

>>> df.shape    # sample df
(12, 2)

>>> %timeit TmSmth(df["indicator"])
45.7 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit pure_np(df["indicator"])
39.2 µs ± 450 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)  ~ 1.1X speed-up

>>> df = pd.concat([df]*1_000, ignore_index=True)
>>> df.shape
(12000, 2)

>>> %timeit TmSmth(df["indicator"])
5.5 ms ± 73.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit pure_np(df["indicator"])
265 µs ± 5.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)   ~ 21X speed-up


>>> df = pd.concat([df]*1_000, ignore_index=True)   # 12 million rows
>>> df.shape
(12000000, 2)

>>> %timeit TmSmth(df['indicator'])
6.43 s ± 455 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit pure_np(df['indicator'])
448 ms ± 58.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)       ~14X speed-up

that's amazing, thank you ! but how on earth you can find a logic like this, what's the reasoning behind the maximum.accumulate thing ? — TmSmth, Mar 06 '21 at 12:37

score 0 · Answer 2 · answered Mar 02 '21 at 12:07

I'm not sure why you insist on not doing a loop, but I have a partial solution:

df['signal'] = [None] * len(df)
df['signal'][df['indicator'] < -1] = 1
df['signal'][df['indicator'] > 1] = -1

loc = df.rolling(window=2).indicator.aggregate(lambda x: x.iloc[0] * x.iloc[1] < 0).fillna(0)
df['signal'][loc > 0] = 0

This results in:

    indicator signal
0   -1.073525      1
1   -0.375276   None
2    0.643764      0
3    0.863731   None
4    1.022312     -1
5    1.125842     -1
6    0.863731   None
7    0.831097   None
8    0.529638   None
9   -0.735926      0
10  -1.743626      1
11  -0.872055   None

From here you need to fill the Nones to the value before, which I am not sure how to do without looping.

Could you explain the idea of the loc please ? And i prefer not doing a loop for performance reason, i've always heard that for big dataframe you can really lose a lot of time doing a loop and you have to prefer vectorisation — TmSmth, Mar 02 '21 at 13:17
Edit : i got the loc idea, i'm waiting for a complete solution if possible for now thanks — TmSmth, Mar 02 '21 at 13:29

TmSmth · Answer 3 · 2021-03-02T15:38:37.523

0

I've found something which works, even if it can be optimized i guess. It's based on @user4340135 answer here Numpy "where" with multiple conditions. I add a global variable to retain the last value :

def conditions(x):
    global s
    if x > 1:
        s = -1
    elif x < -1:
        s = 1
    else:
        if ((s == -1) & (x < 0)) | ((s == 1) & (x > 0)) :
            s = 0
    return s
        
        
df['signal'] = [0] * len(a)
func = np.vectorize(conditions)
df['signal'] = func(df["indicator"])

edited Mar 02 '21 at 15:38

answered Mar 02 '21 at 15:11

TmSmth

450
5
31

I guess this almost works, but you first need to check if value changed signs and only then check if abs>1, since if it was negative and changed to +2 signal should return 0. – Amir Mar 03 '21 at 10:57
I don't understand ? The `if` after the `else` checks for the change of sign.For example if `s = -1` and `x` is négative, it means that it changed sign as `s = - 1` only if `x > 0` and come was above 1 at some time. – TmSmth Mar 03 '21 at 11:06
if indicator is: `[-2, 3, 3]` according to your explanation the signal should become `[1, 0, -1]` since it changed signs from -2 to 3, while in this solution it becomes `[1, -1, -1]` – Amir Mar 03 '21 at 12:09
No it's ok because if it's 3 it changed signs but it's also above 1 so it becomes -1 for the signal, the sign check was for indicator between -1 and 1 not included, sorry if it was not clear – TmSmth Mar 03 '21 at 14:12
Please see the comment you gave me on the original question (not answer), I asked you specifically for this case and you answered: "@Amir the first two situations can't exist as it come back to 0 if indicator changes sign" – Amir Mar 04 '21 at 09:11

score 0 · Answer 4 · answered Mar 05 '21 at 20:53

numpy.where() can be used with two or even more conditional parts as follow:

numpy.where((condion1)|(condition2)) for or problems

numpy.where((condion1)&(condition2)) for and problems

and the output of numpy.where() can be use in this way df.iloc[out_put_of_numpy_where]. Although I didn't quit understand the problem this will solve it.

dataframe column created from another column variations via vectorization

4 Answers4