Labeling whether the numbers in a dataframe is going up first or down first

Question

Let's label a dataframe with two columns, A,B, and 100M rows. Starting at the index i, we want to know if the data in column B is trending down or trending up comparing to the data at [i, 'A'].

Here is a loop:

import pandas as pd
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})

for i in range (0,5):
      j = i
      while j in range (i,i+5) and df.at[i,'label'] == 0:  #if classfied, no need to continue
        if df.at[j,'B']-df.at[i,'A']>= 10:  
          df.at[i,'label'] = 1  #Label 1 means trending up
        if df.at[j,'B']-df.at[i,'A']<= -10: 
          df.at[i,'label'] = 2 #Label 2 means trending down
        j=j+1


    [out]
    A B  label
    0 1   1
    1 10  2
    2 -10 2
    3 2   0
    5 3   0 

    ...

The estimated finishing time for this code is 30 days. (A human with a plot and a ruler might finish this task faster.)

What is a fast way to do this? Ideally without a loop.

you could probably use `.shift()` to create a column that you could then use `diff` on. Then simply have it check those 2 columns to classify. Also, don't use `class` as a variable — chitown88, Aug 08 '22 at 11:33
Code sample has several errors: 1) label undefined, 2) j undefined, 3) j = j+1 should be within while loop, 4) B is undefined, 5) A is undefined. — DarrylG, Aug 08 '22 at 12:35
What is `n`? Are you just wanting a label of `0`, `1`, or `2,` based on the output of `B-A`? That feels very different then the question you asked, but your code suggests this is what you are trying to do. — JNevill, Aug 08 '22 at 12:51

DarrylG · Accepted Answer · 2022-08-08T19:23:02.780

Looping on Dataframe is slow compared to using Pandas methods.

The task can be accomplished using Pandas vectorized methods:

rolling method which does computations in a rolling window
min & max methods which we compute in the rolling window
where method DataFrame where allows us to set values based upon logic

Code

def set_trend(df, threshold = 10, window_size = 2):
    '''
        Use rolling_window to find max/min values in a window from the current point
    
        rolling window normally looks at backward values
    
        We use technique from https://stackoverflow.com/questions/22820292/how-to-use-pandas-rolling-functions-on-a-forward-looking-basis/22820689#22820689
        to look at forward values
    '''
    # To have a rolling window on lookahead values in column B
    #    We reverse values in column B
    df['B_rev'] = df["B"].values[::-1]
    #    Max & Min in B_rev, then reverse order of these max/min
    # https://stackoverflow.com/questions/50837012/pandas-rolling-min-max
    df['max_'] = df.B_rev.rolling(window_size, min_periods = 0).max().values[::-1]
    df['min_'] = df.B_rev.rolling(window_size, min_periods = 0).min().values[::-1]
    nrows = df.shape[0] - 1     # adjustment for argmax & armin indexes since rows are in reverse order 
                                # i.e. idx = nrows - x.argmax() give index for max in non-reverse row
    df['max_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmax(), raw = True).values[::-1]
    df['min_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmin(), raw = True).values[::-1]
           
    # Use np.select to implement label assignment logic
    conditions = [
        (df['max_'] - df["A"] >= threshold) & (df['max_idx'] <= df['min_idx']), # max above & comes first
        (df['min_'] - df["A"] <= -threshold) & (df['min_idx'] <= df['max_idx']), # min below & comes first
        df['max_'] - df["A"] >= threshold,        # max above threshold but didn't come first
        df['min_'] - df["A"] <= -threshold,       # min below threshold but didn't come first
    ]
    choices = [
        1, # max above & came first
        2, # min above & came first
        1, # max above threshold
        2, # min above threshold
    ]
    df['label'] = np.select(conditions, choices, default = 0)
    
    # Drop scratch computation columns
    df.drop(['B_rev', 'max_', 'min_', 'max_idx', 'min_idx'], axis = 1, inplace = True)

    return df

Tests

Case 1

df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
display(set_trend(df, 10, 4))

Case 2

df = pd.DataFrame({'A': [0,1,2], 'B': [1, -10, 10]})
display(set_trend(df, 10, 4))

Output

Case 1

    A   B   label
0   0   1   1
1   1   10  2
2   2   -10 2
3   3   2   0
4   5   3   0
5   0   0   0
6   0   0   0
7   0   0   0
8   0   0   0
9   0   0   0

Case 2

    A   B   label
0   0   1   2
1   1   -10 2
2   2   10  0

Very neat! I wonder if I could learn more. Consider the labeling of the first row, A=0. Suppose the window size is four, and the max of B should be 10 and the min of B should be -10. So both thresholds are met. In the loop code (main question), the trend is determined by which threshold is met first. I think your code also accomplish this ordering, but I am not sure how did it do it. I think I can make the df['max'] into a two-dimensional dataframe with an index added. In the final step I can compare the indexes of df['max'] and df['min'] — High GPA, Aug 08 '22 at 17:48
For example, let df = pd.DataFrame({'A': [0,1,2], 'B': [1, -10, 10]}). The label at 0 should be 2. But the code gives 1 as output. — High GPA, Aug 08 '22 at 18:17
@HighGPA -- I updated my answer to take into account which comes first the min or max in a rolling window. — DarrylG, Aug 08 '22 at 19:24

Labeling whether the numbers in a dataframe is going up first or down first

1 Answers1