0

I have a pandas dataframe df which I queried using a for loop that captures the events by considering the dummy variable of the Status column. The logic is that when Status changes from 0 to 1, it marks the timestamp at 1 as the start of the event, when Status changes from 1 to 0, it marks the timestamp at 0 as the end of the event.

df looks like this:

        ID  Timestamp               Value   Status
103177  64  2010-09-21 23:13:21.090 21.5    1.0
252019  64  2010-09-22 00:44:14.890 21.5    1.0
271381  64  2010-09-22 00:44:15.890 21.5    0.0
268939  64  2010-09-22 00:44:17.890 23.0    0.0
259875  64  2010-09-22 00:44:18.440 23.0    1.0
18870   64  2010-09-22 00:44:19.890 24.5    1.0
205910  64  2010-09-22 00:44:23.440 24.5    0.0
103865  64  2010-09-22 01:04:33.440 23.5    0.0
152281  64  2010-09-22 01:27:01.790 22.5    0.0
138988  64  2010-09-22 02:18:52.850 21.5    0.0
             ...

reproducible example:

from pandas import *
from numpy import nan

df = pd.DataFrame({'ID': {103177: 64,
  252019: 64,
  271381: 64,
  268939: 64,
  259875: 64,
  18870: 64,
  205910: 64,
  103865: 64,
  152281: 64,
  138988: 64},
 'Timestamp': {103177: Timestamp('2010-09-21 23:13:21.090000'),
  252019: Timestamp('2010-09-22 00:44:14.890000'),
  271381: Timestamp('2010-09-22 00:44:15.890000'),
  268939: Timestamp('2010-09-22 00:44:17.890000'),
  259875: Timestamp('2010-09-22 00:44:18.440000'),
  18870: Timestamp('2010-09-22 00:44:19.890000'),
  205910: Timestamp('2010-09-22 00:44:23.440000'),
  103865: Timestamp('2010-09-22 01:04:33.440000'),
  152281: Timestamp('2010-09-22 01:27:01.790000'),
  138988: Timestamp('2010-09-22 02:18:52.850000')},
 'Value': {103177: 21.5,
  252019: 21.5,
  271381: 21.5,
  268939: 23.0,
  259875: 23.0,
  18870: 24.5,
  205910: 24.5,
  103865: 23.5,
  152281: 22.5,
  138988: 21.5},
 'Status': {103177: 1.0,
  252019: 1.0,
  271381: 0.0,
  268939: 0.0,
  259875: 1.0,
  18870: 1.0,
  205910: 0.0,
  103865: 0.0,
  152281: 0.0,
  138988: 0.0}})

df

query for-loop:

def my_func1(df):
    
    dataframe = []
    Start_time = []
    StartValue   = []
    End_time = []
    End_Value = []
    StartStatus = []
    End_Status = []
    ID = []

    state = 0

    for i in range(1, len(df.index)):

        if state == 0:

            if (df.loc[i, 'Status'] > 0):

                Start_time = df.loc[i, 'Timestamp']
                StartValue = df.loc[i, 'Value']  
                StartStatus = df.loc[i, 'Status']
                ID = df.loc[i, 'ID']

                state = 1

            else:
                continue

        elif state == 1:

            if (df.loc[i, 'Status'] == 0):
            
                End_time = df.loc[i, 'Timestamp']
                End_Value = df.loc[i, 'Value']
                End_Status = df.loc[i, 'Status'] 
                
                state = 0

            else:
                continue


            dataframe.append([ ID, Start_time, StartValue, StartStatus, End_time, End_Value, End_Status])       

        else:
            continue

    # Dataframe 
    output_table = pd.DataFrame(dataframe, columns= ["ID", "StartTime", "StartValue", "StartStatus",  "EndTime", "EndValue", "EndStatus"])

    return output_table

and output table format looks like this(just an example, not same data as above):

    ID  StartTime                StartValue StartStatus EndTime                 EndValue EndStatus
0   64  2010-09-22 00:44:14.890  21.5       1.0         2010-09-22 00:44:15.890 21.5     0.0
1   64  2010-09-22 00:44:18.440  23.0       1.0         2010-09-22 00:44:23.440 24.5     0.0
2   64  2010-09-23 00:06:48.970  100.0      1.0         2010-09-23 00:06:52.970 100.0    0.0
3   64  2010-09-23 00:07:14.970  100.0      1.0         2010-09-23 00:07:23.970 100.0    0.0
4   64  2010-09-23 02:07:52.990  100.0      1.0         2010-09-23 02:08:13.990 100.0    0.0

Now I want to look into each of the events inside the original dataframe df using each row in output_table, where each event can be found by filtering Timestamp in between StartTime and EndTime. How can I apply another of my own defined function my_func2 to the data of each these individual event in df? (my_func2 is supposed to create a label in output_table that represents if each event satisfies some condition by considering the Value column.)


Edit:

To be more specific, my_func2 measures continuous time interval where Value stays above 30 for at least 10 seconds. Then I want to find the overlap of my_func1 and my_func2, and eventually calculate the median value in this time interval for each event.

I guess I could work out the overlap by comparing the timestamps of the start and end points of the 2 events(calculated by the two functions). What I need help with is how to find the original data in df for calculating the median value for each of the individual event? Any idea is appreciated.

nilsinelabore
  • 4,143
  • 17
  • 65
  • 122
  • Can you provide the data in a more convenient format? You can see [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391) for more information. – AMC Oct 26 '20 at 01:24
  • 1
    Please state the requirements of the actual solution and exemplify directly. Avoid the [XY problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem#:~:text=What%20is%20it%3F,trouble%2C%20you%20ask%20about%20Y.) because your question need not be solved with your proposed scheme. – Bill Huang Oct 26 '20 at 02:21
  • Hi @AMC please see edited question. – nilsinelabore Oct 26 '20 at 02:22
  • About the statement "0 -> 1 means start" -- which one record marks the start? And does your "between" include the starting row? Same problem happens for the end. Please exemplify or at least provide explicit and clear rules. – Bill Huang Oct 26 '20 at 02:41
  • Hi @BillHuang it marks the start at `Status` turning 1 (from 0), and it marks the end as `Status` turns 0 (from 1). Yes 'between' is inclusive of start and end values – nilsinelabore Oct 26 '20 at 02:42
  • @BillHuang Thanks for your feedback, please see edited question. – nilsinelabore Oct 26 '20 at 02:48
  • If the initial status is `1`, should this streak of 1's be counted as a part of an event? – Bill Huang Oct 26 '20 at 02:57
  • No. The initial status is 1 in your example. May you explain? This is crucial information. At least in my implementation, such boundary condition does matter. – Bill Huang Oct 26 '20 at 03:16
  • @BillHuang Yes, sorry for the confusion, the streak of 1's should be counted as a part of an event at the beginning. – nilsinelabore Oct 26 '20 at 03:22

1 Answers1

1

You don't have to extract the rows after the intervals are found. The intervals as well as the rows in between can be, and efficiency-wise should be, extracted at the same time.

Solution

The key to efficient extraction is to capture the event streak by the diff-cumsum trick. Your case is a little bit special because the next element after a streak also counts as the streak. This is reflected in the definition of the flag is_ev in the code.

After the diff-cumsum trick is performed, event streaks and non-event streaks will interleave. The right modulus against 2 is determined by the initial Status.

df["diff"] = df["Status"].diff()
df["is_start"] = df["diff"] == 1
df["is_end"] = df["diff"] == -1
df["is_ev"] = (df["Status"] == 1) | df["is_end"]
df["ev_number"] = df["is_ev"].diff().cumsum()
df["ev_number"].iat[0] = 0

# output 1: interval marks
df_start_end = df[df["is_start"] | df["is_end"]]

# output 2: records between (inclusive) events in chronological ordering
# - different events can be subset using
#   df_records[df_records["ev_number"] == N]
# - Why set the modulus:
#   If df begins with Status = 1, event records will have even df["ev_number"], namely modulus = 0 against divisor 2
#   If df begins with Status = 0, event records will have odd df["ev_number"], so modulus = 1
modulus = 0 if df["Status"].iat[0] == 1 else 1
df_records = df[df["ev_number"] % 2 == modulus]

Result

print(df_start_end.drop(columns=["diff", "is_ev"]))

        ID               Timestamp  Value  Status  is_start  is_end ev_number
271381  64 2010-09-22 00:44:15.890   21.5     0.0     False    True         0
259875  64 2010-09-22 00:44:18.440   23.0     1.0      True   False         2
205910  64 2010-09-22 00:44:23.440   24.5     0.0     False    True         2

print(df_records.drop(columns=["diff", "is_start", "is_end", "is_ev"]))

        ID               Timestamp  Value  Status ev_number
103177  64 2010-09-21 23:13:21.090   21.5     1.0         0
252019  64 2010-09-22 00:44:14.890   21.5     1.0         0
271381  64 2010-09-22 00:44:15.890   21.5     0.0         0
259875  64 2010-09-22 00:44:18.440   23.0     1.0         2
18870   64 2010-09-22 00:44:19.890   24.5     1.0         2
205910  64 2010-09-22 00:44:23.440   24.5     0.0         2

Further steps

1. List of Intervals (StartTime, EndTime)

df_out = df_start_end.loc[df_start_end["is_start"], ["Timestamp", "ev_number"]]\
    .merge(df_start_end.loc[df_start_end["is_end"], ["ID", "Timestamp", "ev_number"]], how="outer", on="ev_number")\
    .rename(columns={"Timestamp_x": "StartTime", "Timestamp_y": "EndTime"})\
    .sort_values("EndTime")\
    .reset_index(drop=True)

# add the first record as first interval StartTime when needed
if modulus == 0:
    df_out["StartTime"].iat[0] = df["Timestamp"].iat[0]

print(df_out[["ID","StartTime","EndTime","ev_number"]])
   ID               StartTime                 EndTime  ev_number
0  64 2010-09-21 23:13:21.090 2010-09-22 00:44:15.890        0.0
1  64 2010-09-22 00:44:18.440 2010-09-22 00:44:23.440        2.0

2. Append (StartTime, EndTime) to the right of df

This can be easily done by matching ev_number.

df_with_interval = df\
    .merge(df_start_end.loc[df_start_end["is_start"], ["Timestamp", "ev_number"]], on="ev_number", suffixes=("", "_1"))\
    .merge(df_start_end.loc[df_start_end["is_end"], ["Timestamp", "ev_number"]], on="ev_number", suffixes=("", "_2"))\
    .rename(columns={"Timestamp_1": "StartTime", "Timestamp_2": "EndTime"})\
    .sort_values("Timestamp")

print(df_with_interval)
   ID               Timestamp  ...               StartTime                 EndTime
0  64 2010-09-22 00:44:18.440  ... 2010-09-22 00:44:18.440 2010-09-22 00:44:23.440
1  64 2010-09-22 00:44:19.890  ... 2010-09-22 00:44:18.440 2010-09-22 00:44:23.440
2  64 2010-09-22 00:44:23.440  ... 2010-09-22 00:44:18.440 2010-09-22 00:44:23.440

However, I will no go further because I don't know the real use case of the data. I believe the central problem is already solved up to this point.

N.B. This is an example on why you shouldn't assign the means of solutions when you ask questions.

Bill Huang
  • 4,491
  • 2
  • 13
  • 31
  • Thank you for the solution. Could you please explain `# output 2` in more details? P.S. I think the merge didn't work as intended as it had created repeated columns. – nilsinelabore Oct 26 '20 at 05:18
  • What’s the purpose of these two steps: ‘modulus = 0 if df["Status"].iat[0] == 1 else 1 df_records = df[df["ev_number"] % 2 == modulus]’? – nilsinelabore Oct 26 '20 at 21:01
  • Would I be able to groupby some columns, like `df["ev_number"] = df.groupby(['A', 'B'])["is_ev"].diff().cumsum()`? But I got the same consecutive `ev_number` – nilsinelabore Oct 26 '20 at 22:43
  • 1
    That is another question based on a different dataset. You may post another question directly. Also note that SO is a Q&A site, not a discussion forum, so comments unrelated to the original question or this answer are discouraged. If you have another good follow-up question, please just ask separately. This is not uncommon on SO. N.B. I deleted my other comments because they were now integrated into the post. You may also do so to reduce distraction for other potential readers. – Bill Huang Oct 26 '20 at 23:44
  • Why don't the `True` and `False` values of `is_ev` column offset each other in `cumsum`? – nilsinelabore Oct 28 '20 at 05:23