0

Copy the following dataframe to your clipboard (select fully):

   Time    Group1     Value  Group2
0    05    A           0.91   0.5/1
1    05    A           0.88   0.5/1
2    05    A           0.82   0.5/1
3    06    A           0.79   0.5/1
4    08    A           0.74   0.5/1
5    08    A           0.71   1
6    09    A           0.70   0.5/1
7    09    A           0.81   0.5/1
8    10    A           0.86   0.5/1
9    11    A           0.83   0.5/1
11   13    B           0.80   0.5/1
12   13    B           0.71   0.5/1
13   14    B           0.87   0.5/1
14   14    B           0.92   0.5/1
15   14    B           0.97   0.5/1
16   15    B           1.07   0.5/1
17   15    B           0.76     0.5
18   15    B           1.03   0.5/1
19   15    B            NaN     NaN
20   16    C           0.81     0.5
21   16    C           0.82     0.5
22   16    C           0.85     0.5
23   17    C           0.95     0.5
24   18    C           0.90     0.5
25   18    C           0.87     0.5
26   18    C           0.82       1
27   18    C           0.86     0.5
28   18    C           0.83     0.5
29   19    C           1.04   0.5/1
30   19    C           0.99   0.5/1
31   20    C           0.95   0.5/1
32   20    C           0.92   0.5/1
33   22    C           0.89   0.5/1
34   22    C           0.92   0.5/1
35   23    C           0.90   0.5/1
36   23    C           0.93   0.5/1

Now use

import pandas as pd
df=pd.read_clipboard(sep='\s\s+')

to load it into your environment.

ALGORITHMN / QUESTION How to find every sequence of declining values in column Value, which has at least N declining values and a total decline of >= DELTA. Multiple such sequences may exist in the Dataframe and shall be output. A sequence is interrupted, if the value in either Group1 / Group2 column changes. If the value in Values does not decline but stays the same, this is not counting towards reaching N rows, but does neither interrupt the sequence.

DESIRED OUTPUT: Desired output shall be a list of sequences, for every sequence I want to know the start-time, the end-time, the start-value in check and the end-value in check, and also the length, i.e. the number of declining steps.

EXAMPLES FOR THE DATAFRAME: For N = 4 and DELTA = 0.1, I would find: Time 05 with value 0.91 until Time 08 with Value 0.74. The sequence ends because the next row contains a change in Group2, even though the value would decline further. This sequence has 4 steps and a total delta of 0.91-0.74 = 0.17.

The second sequence goes from Time 19 (value 1.04) until Time 22 (value 0.89), 4 declines and a total Delta of 0.15.

SOLUTION: Here are my thoughts:

(1) Combine both Groups to a new column which could be checked for sameness against previous row (similar to this one -> Comparing previous row values in Pandas DataFrame)

df['combined'] = df['Group1'] + '___' + df['Group2']
df['same'] = df['combined'].eq(df['combined'].shift())

(2) Find whether the column Value is declining, calculate difference:

df['decline'] = df['Value'].diff().fillna(0.0)

(3) Delete rows which have df['decline'] == 0, i.e. just a same value, because this wont count towards reaching threshold N, so not required:

df = df[df['decline'] != 0]

(4) Combine (1) and (3), i.e. check for declining values if they are not interrupted by a change of value in Group1 or Group2:

df['is_sequence'] = (df['decline'] < 0) & df['same'] # Still fails, have to check how to make it correct...

And now check the length of the total delta of the sequences...

Any ideas?

EDIT says: Maybe some functions like .rolling() or grouping might help!? I found a similar question here about finding sequences: Python Pandas: Find a pattern in a DataFrame which uses a for-loop. You do not need to suggest something like this, because I already know how to do it in a loop. I'm asking here for a more pandas/pythonic solution if possible

tim
  • 9,896
  • 20
  • 81
  • 137

1 Answers1

2
Algorithm
  1. Create groups where "Group1" and "Group2" are identical in consecutive rows.
  2. Call custom function that returns the data from the longest streak of declines.
    • groupby to calculate consecutive streaks of declines
    • return longest streak if it is longer than N
Code
def streaks(x, N=4):
    declines = x["Value"].diff().fillna(0)<=0
    streaks = x.groupby(declines.ne(declines.shift()).cumsum())["Value"].transform("count")
    
    longest = streaks[streaks>N].max()-1
    indices = streaks[streaks==streaks[streaks>N].max()].index
    if len(indices)==0:
        return None
    
    result = pd.Series({"Start time": x.at[indices[0], "Time"],
                        "End time": x.at[indices[-1], "Time"],
                        "Start value": x.at[indices[0], "Value"],
                        "End value": x.at[indices[-1], "Value"],
                        "Steps": longest,
                        "Delta": x.at[indices[0], "Value"]-x.at[indices[-1], "Value"]
                        })
    return result

groups = ((df["Group1"].ne(df["Group1"].shift()))|(df["Group2"].ne(df["Group2"].shift()))).cumsum()
output = df.groupby(groups).apply(streaks).dropna(how="all").reset_index(drop=True)
Output
>>> output
   Start time  End time  Start value  End value  Steps  Delta
0         5.0       8.0         0.91       0.74    4.0   0.17
1        19.0      22.0         1.04       0.89    4.0   0.15
not_speshal
  • 22,093
  • 2
  • 15
  • 30