How can I iterate over these two very large alarm dataframes faster?

Question

I have two pandas DataFrames. One is alarm condition starting events, and the other is alarm condition ending events. Each event has an Alarm ID and a Message Time, and the stop DataFrame has a value for Termination Type. For each event in start, I want to add the next time the same Alarm ID registers a stop and the Termination Type for that stop event. I also want to calculate the Alarm Active Seconds between the start event and the stop event. Some start events may not have a corresponding stop event, some start events may have a simultaneous stop event (in which case the Alarm Active Seconds is zero), and more than one start event may share the same stop event.

def alarmActiveTime(start, stop):
    i = 0
    while i < len(start):
        j = stop['Message Time'].searchsorted([start.iloc[i].loc['Message Time']], side='left')[0]
        match_found = False
        while (j < len(stop)) & (match_found == False):
            if ((stop.iloc[j].loc['Message Time'] >= start.iloc[i].loc['Message Time']) & (stop.iloc[j].loc['Alarm ID'] == start.iloc[i].loc['Alarm ID'])):
                start.at[i, 'Terminated Time'] = stop.iloc[j].loc['Message Time']
                start.at[i, 'Termination Type'] = stop.iloc[j].loc['Termination Type']
                start.at[i, 'Alarm Active Seconds'] = int((start.iloc[i].loc['Terminated Time'] - start.iloc[i].loc['Message Time']).total_seconds())
                match_found = True
                j = len(stop)
            else:
                j += 1
        if not match_found:
            start.at[i, 'Terminated Time'] = 'undefined'
            start.at[i, 'Alarm Active Seconds'] = 'undefined'
        if ((i % 1000 == 0) | (i == len(start)-1)):
            print(f'{i} OF {len(start)} COMPLETE.') 
            print("%s SECONDS ELAPSED." % round((time.time() - start_time), 0)) # https://stackoverflow.com/questions/1557571/
            print('')
        i += 1 
    return start

I have about 570,000 start events and 480,000 stop events, so it takes hours to run. I'm looking for ways to improve the run time of this loop.

First, I tried filtering the stop DataFrame by Alarm ID.

def alarmActiveTime(start, stop):
    print('CALCULATING ACTIVE TIME OF ALARM...')
    i = 0
    while i < len(start):
        thisStop = stop.loc[(stop['Alarm ID'] == start.iloc[i].loc['Alarm ID'])]
        j = thisStop['Message Time'].searchsorted([start.iloc[i].loc['Message Time']], side='left')[0]
        match_found = False # Flag variable to track if a match is found
        while (j < len(thisStop)) & (match_found == False):
            if ((thisStop.iloc[j].loc['Message Time'] >= start.iloc[i].loc['Message Time']) & (thisStop.iloc[j].loc['Alarm ID'] == start.iloc[i].loc['Alarm ID'])):
                start.at[i, 'Terminated Time'] = thisStop.iloc[j].loc['Message Time']
                start.at[i, 'Termination Type'] = thisStop.iloc[j].loc['Termination Type']
                start.at[i, 'Alarm Active Seconds'] = int((start.iloc[i].loc['Terminated Time'] - start.iloc[i].loc['Message Time']).total_seconds())
                match_found = True
                j = len(thisStop)
            else:
                j += 1
        if not match_found:
            start.at[i, 'Terminated Time'] = 'undefined'
            start.at[i, 'Alarm Active Seconds'] = 'undefined'
        if ((i % 1000 == 0) | (i == len(start)-1)):
            print(f'{i} OF {len(start)} COMPLETE.') 
            print("%s SECONDS ELAPSED." % round((time.time() - start_time), 0)) # https://stackoverflow.com/questions/1557571/
            print('')
        i += 1 
    print("%s HOURS ELAPSED." % round(((time.time() - start_time)/3600), 2))
    return start

I also tried filtering the stop DataFrame by Alarm ID and Message Time, and then returning the first element if the result was larger than 0.

def alarmActiveTime(start, stop):
    print('CALCULATING ACTIVE TIME OF ALARM...')
    i = 0
    while i < len(start):
        thisStop = stop.loc[(stop['Alarm ID'] == start.iloc[i].loc['Alarm ID'])]
        thisStop = thisStop.loc[thisStop['Message Time'] >= start.iloc[i].loc['Message Time']]
        if (len(thisStop) > 0):
                start.at[i, 'Terminated Time'] = thisStop.iloc[0].loc['Message Time']
                start.at[i, 'Termination Type'] = thisStop.iloc[0].loc['Termination Type']
                start.at[i, 'Alarm Active Seconds'] = int((start.iloc[i].loc['Terminated Time'] - start.iloc[i].loc['Message Time']).total_seconds())
        else:
            start.at[i, 'Terminated Time'] = 'undefined'
            start.at[i, 'Alarm Active Seconds'] = 'undefined'
        if ((i % 1000 == 0) | (i == len(start)-1)):
            print(f'{i} OF {len(start)} COMPLETE.') 
            print("%s SECONDS ELAPSED." % round((time.time() - start_time), 0)) # https://stackoverflow.com/questions/1557571/
            print('')
        i += 1 
    return start

These methods were both faster than searching through an unfiltered stop log, but they were roughly as fast as each other when I timed the first 20,000 loops or so. I presume that filtering by Message Time could have a greater impact as I work through the list. Is there anything else I could do to run this kind of loop faster, or is it just going to take that long to do this many comparisons?

Edit: I have 14274 unique alarm ID values in the start dataframe. When I tried to merge the dataframes together, I got an array memory error. I cannot post a screenshot of the alarm logs, but will post another update after I've got something masked that I can post.

Can you show us some example data, and the expected output? Almost always with pandas, the moment you start manually looping, you're shooting yourself and your performance in the foot — Pep_8_Guardiola, Jun 12 '23 at 13:40
Kindly provide a snippet of your dataframe and expected out. — Himanshu Panwar, Jun 12 '23 at 13:54
Can't you just merge the two dataframes together using the AlarmID column and maybe a condition to make sure the stop time is later than the start? like others have said a snippet of the input data and an example of the output would be good to see — Hillygoose, Jun 12 '23 at 14:24

score 0 · Answer 1 · answered Jun 12 '23 at 14:31

As @Pep_8_Guardiola said, in most cases you shouldn't iterate over pandas dataframes. You said that both have AlarmID. So my first suggestion would be to Merge both dataframes by AlarmID. By doing this, you would have in each row both the start time, stop time, and all other columns in both DataFrames.

From that it should be way easier to do the calculations you need. For example:

df['alarm_total_active_time'] = df['alarm_stop_time'] - df['alarm_start_time']

I'm not entirely sure the previous code works with the data you have, because I don't know if it is in datetime format or which DTYPE, but something similar should work. (Perhaps using the Timedelta method from Pandas.Timestamp could help)

For a more complete answer, you should provide more information on the dataframe you have. Search the pandas documentation for more information on how to handle dates and timestamps and how to perform operations inside the dataframe. As mentioned already, usually iterating over the DF is the worst solution for the problem you have.

I tried using merge, but I got an array memory error. Even if I had the memory to get the merge through, would I be able to map the same stop event to multiple start events? — Leon Carrubba, Jun 12 '23 at 19:52

score 0 · Accepted Answer · answered Jun 26 '23 at 20:36

I have 14274 unique alarm ID values in the start dataframe. When I tried to merge the dataframes together, I got an array memory error. I couldn't post a screenshot of the alarm logs because the data is confidential. However, I was able to find a solution that cut the run time of the function from ~7 hours to ~30 minutes. I don't expect I will have to run this function very often, so this run time is acceptable.

First I filtered the dataframe based on 'Alarm ID'. I tried merging the filtered dataframes, but that didn't generate the output I was looking for. I used apply.lambda() to generate a tuple of the variables I was looking for. Then I defined two new columns based on the tuple and concatenated the filtered dataframe to the output.

def alarmActiveTime(start, stop):
    output = pd.DataFrame()
    listIDs = start['Alarm ID'].unique().tolist()
    for x in listIDs:
        def termination(time): #This was broken out into two different functions at first, but by returning a tuple we can get both important values at once without having to run essentially the same loop twice
            thisDF = thisStop[(thisStop['Message Time']>= time)]
            timeList = thisDF['Message Time'].tolist()
            typeList = thisDF['Termination Type'].tolist()
            if len(timeList) > 0:
                timeOutput = timeList[0]
                typeOutput = typeList[0]
            else:
                timeOutput = 'Undefined'
                typeOutput = 'Undefined'
            termTuple = (timeOutput, typeOutput)
            return termTuple
        thisStart = start.loc[start['Alarm ID'] == x]
        thisStop = stop.loc[stop['Alarm ID'] == x]
        thisStart['Termination Tuple'] = thisStart.apply(lambda x: termination(x['Message Time']), axis=1)
        thisStart['Termination Time'] = thisStart['Termination Tuple'].apply(lambda x: x[0])
        thisStart['Termination Type'] = thisStart['Termination Tuple'].apply(lambda x: x[1])
        thisStart.drop('Termination Tuple', axis=1, inplace=True)
        output = pd.concat([output, thisStart])
    return output

I'm sure this could be further optimized, but this seemed like it was manageable. Thank you everyone who gave feedback. Let me know if I'm missing something.

How can I iterate over these two very large alarm dataframes faster?

2 Answers2