Pandas: Holding on to an output until a change happens in parameter X

Question

I am trying to identify different phases in a process. What I basically need to create is the following:

When Parameter A > certain value: Output = Phase 1; keep this value until:
Parameter B reaches a certain value, then Output = Phase 2

This is of course quite easy to program with a generator, however, the tricky part here is that sometimes it can go back from phase 2 to 1 or it can also skip a phase.

I am not quite sure how to do this. Ideally the code would look at a parameter, and when it changes decide to go back or forward in the phases.

I came up with some sample code below:

Give an output for Phase 1 when Parameter A reaches 1.
Hold on to Phase 1 until Parameter B changes to >120 or parameter A >= 2.
Hold on until parameter A < 1.5 --> go back to Phase 1 or Hold on until parameter A > 3 --> go forward to Phase 3.

I hope this question is clear. The real dataset has 36 parameters so I simplified the case a bit to not make it any more complicated than necessary!

I hope you can help me out!

import pandas as pd

data = {
  "Date and Time": ["2020-06-07 00:00", "2020-06-07 00:01", "2020-06-07 00:02", "2020-06-07 00:03", "2020-06-07 00:04", "2020-06-07 00:05", "2020-06-07 00:06", "2020-06-07 00:07", "2020-06-07 00:08", "2020-06-07 00:09", "2020-06-07 00:10"],
  "Parameter A": [1, 1, 1, 1, 1.5, 2, 2.1, 2.2, 2.3, 1.6, 1.2],
  "Parameter B": [100, 101, 99, 102, 101, 105, 120, 125, 122, 123, 99],
  "Required output": ["Phase 1", "Phase 1","Phase 1","Phase 1","Phase 1","Phase 2","Phase 2","Phase 2","Phase 2","Phase 2","Phase 1"]
}

df = pd.DataFrame(data)

why does it go to 'Phase 2' when `a==2` and `b==105`? To make your question clearer, I would: 1. add a state machine representation, or some code that captures the logic "with a generator", since you say that is easy to do. 2. give a better example with more transitions, and some comments about why there is a transition. — Pierre D, May 12 '21 at 12:44
Ah my bad! It should move up when either parameter A >= 2 or parameter B >120. I edited the original post. — Mel, May 12 '21 at 12:47
There is, but the part where a<1 is not part of the analysis and filtered out beforehand. — Mel, May 12 '21 at 12:54

Pierre D · Accepted Answer · 2021-05-12T14:56:23.093

The basic problem you are trying to solve is to implement a hysteresis (i.e. where a state depends on history).

Aside from that, the logic to capture intervals of a and b can be expressed using pd.cut().

a = df['Parameter A']
b = df['Parameter B']

cat_a = pd.cut(a, [-np.inf, 1, 1.5, 2, 3, np.inf], labels=[0,1,1.5,2,3], right=False)
cat_b = pd.cut(b, [-np.inf, 120, np.inf], labels=[0,2], right=False)

For cat_a, we have a bin (labeled 1.5) that corresponds to the "uncertain" zone between 1.5 and 2, where the hysteresis takes place (in that area, if the previous phase was >= 2, use 2, otherwise use 1).

We use max between cat_a and cat_b to establish a history-independent (tmp) value:

tmp = pd.concat([cat_a, cat_b], axis=1).max(axis=1)
>>> df.assign(tmp=tmp)
       Date and Time  Parameter A  Parameter B Required output  tmp
0   2020-06-07 00:00          1.0          100         Phase 1  1.0
1   2020-06-07 00:01          1.0          101         Phase 1  1.0
2   2020-06-07 00:02          1.0           99         Phase 1  1.0
3   2020-06-07 00:03          1.0          102         Phase 1  1.0
4   2020-06-07 00:04          1.5          101         Phase 1  1.5
5   2020-06-07 00:05          2.0          105         Phase 2  2.0
6   2020-06-07 00:06          2.1          120         Phase 2  2.0
7   2020-06-07 00:07          2.2          125         Phase 2  2.0
8   2020-06-07 00:08          2.3          122         Phase 2  2.0
9   2020-06-07 00:09          1.6          123         Phase 2  2.0
10  2020-06-07 00:10          1.2           99         Phase 1  1.0

Now, to implement the hysteresis, we use this SO answer which uses numpy. It is slightly adapted to include the left side of intervals:

import numpy as np


def hyst(x, th_lo, th_hi, initial = False):
    hi = x >= th_hi
    lo_or_hi = (x < th_lo) | hi
    ind = np.nonzero(lo_or_hi)[0]
    if not ind.size: # prevent index error if ind is empty
        return np.zeros_like(x, dtype=bool) | initial
    cnt = np.cumsum(lo_or_hi) # from 0 to len(x)
    return np.where(cnt, hi[ind[cnt-1]], initial)

This returns a boolean value that indicates whether the phase should be "high" (True) or "low" (False). We then replace the uncertain values (1.5) with 1 or 2 depending on the hysteresis. Finally, we assign the numerical value of phase into a string:

phase = tmp.where(tmp != 1.5, np.where(hyst(tmp.values, 1.5, 2), 2, 1))
df = df.assign(phase='Phase ' + phase.astype(int).astype(str))
>>> df
       Date and Time  Parameter A  Parameter B Required output    phase
0   2020-06-07 00:00          1.0          100         Phase 1  Phase 1
1   2020-06-07 00:01          1.0          101         Phase 1  Phase 1
2   2020-06-07 00:02          1.0           99         Phase 1  Phase 1
3   2020-06-07 00:03          1.0          102         Phase 1  Phase 1
4   2020-06-07 00:04          1.5          101         Phase 1  Phase 1
5   2020-06-07 00:05          2.0          105         Phase 2  Phase 2
6   2020-06-07 00:06          2.1          120         Phase 2  Phase 2
7   2020-06-07 00:07          2.2          125         Phase 2  Phase 2
8   2020-06-07 00:08          2.3          122         Phase 2  Phase 2
9   2020-06-07 00:09          1.6          123         Phase 2  Phase 2
10  2020-06-07 00:10          1.2           99         Phase 1  Phase 1

In summary

The full code is (in addition to the hyst() function above):

a = df['Parameter A']
b = df['Parameter B']

cat_a = pd.cut(a, [-np.inf, 1, 1.5, 2, 3, np.inf], labels=[0,1,1.5,2,3], right=False)
cat_b = pd.cut(b, [-np.inf, 120, np.inf], labels=[0,2], right=False)
tmp = pd.concat([cat_a, cat_b], axis=1).max(axis=1)
phase = tmp.where(tmp != 1.5, np.where(hyst(tmp.values, 1.5, 2), 2, 1))
df = df.assign(tmp=tmp, phase='Phase ' + phase.astype(int).astype(str))

Hopefully, you can adapt and extend this logic for your 36-parameter case.

Another example

To better illustrate the phase transitions and the logic, here is another example:

df = pd.DataFrame([
    [0, 0],
    [1, 100],
    [1.2, 100],
    [1.5, 100],
    [1.6, 100],
    [2, 100],
    [2.1, 100],
    [1.6, 100],
    [1.5, 100],
    [1.4, 100],
    [1.4, 120],
    [1.5, 100],
    [3, 100],
    [1.5, 100],
    [1.6, 100],
    [1.4, 100],
], columns=['Parameter A', 'Parameter B'])

Running the code above, and adding tmp to the df for inspection, we see (with comments added by hand):

>>> df.assign(tmp=tmp, phase='Phase ' + phase.astype(int).astype(str))
    Parameter A  Parameter B  tmp    phase
0           0.0            0  0.0  Phase 0
1           1.0          100  1.0  Phase 1
2           1.2          100  1.0  Phase 1
3           1.5          100  1.5  Phase 1  # in hyst., but prev was low
4           1.6          100  1.5  Phase 1
5           2.0          100  2.0  Phase 2
6           2.1          100  2.0  Phase 2
7           1.6          100  1.5  Phase 2  # in hyst. but prev was high
8           1.5          100  1.5  Phase 2
9           1.4          100  1.0  Phase 1
10          1.4          120  2.0  Phase 2  # goes to 2 bc b >= 120
11          1.5          100  1.5  Phase 2
12          3.0          100  3.0  Phase 3
13          1.5          100  1.5  Phase 2  # note: not 3, even though prev was 3
14          1.6          100  1.5  Phase 2
15          1.4          100  1.0  Phase 1

Pandas: Holding on to an output until a change happens in parameter X

1 Answers1

In summary

Another example