I have two pandas DataFrames. One is alarm condition starting events, and the other is alarm condition ending events. Each event has an Alarm ID and a Message Time, and the stop DataFrame has a value for Termination Type. For each event in start, I want to add the next time the same Alarm ID registers a stop and the Termination Type for that stop event. I also want to calculate the Alarm Active Seconds between the start event and the stop event. Some start events may not have a corresponding stop event, some start events may have a simultaneous stop event (in which case the Alarm Active Seconds is zero), and more than one start event may share the same stop event.
def alarmActiveTime(start, stop):
i = 0
while i < len(start):
j = stop['Message Time'].searchsorted([start.iloc[i].loc['Message Time']], side='left')[0]
match_found = False
while (j < len(stop)) & (match_found == False):
if ((stop.iloc[j].loc['Message Time'] >= start.iloc[i].loc['Message Time']) & (stop.iloc[j].loc['Alarm ID'] == start.iloc[i].loc['Alarm ID'])):
start.at[i, 'Terminated Time'] = stop.iloc[j].loc['Message Time']
start.at[i, 'Termination Type'] = stop.iloc[j].loc['Termination Type']
start.at[i, 'Alarm Active Seconds'] = int((start.iloc[i].loc['Terminated Time'] - start.iloc[i].loc['Message Time']).total_seconds())
match_found = True
j = len(stop)
else:
j += 1
if not match_found:
start.at[i, 'Terminated Time'] = 'undefined'
start.at[i, 'Alarm Active Seconds'] = 'undefined'
if ((i % 1000 == 0) | (i == len(start)-1)):
print(f'{i} OF {len(start)} COMPLETE.')
print("%s SECONDS ELAPSED." % round((time.time() - start_time), 0)) # https://stackoverflow.com/questions/1557571/
print('')
i += 1
return start
I have about 570,000 start events and 480,000 stop events, so it takes hours to run. I'm looking for ways to improve the run time of this loop.
First, I tried filtering the stop DataFrame by Alarm ID.
def alarmActiveTime(start, stop):
print('CALCULATING ACTIVE TIME OF ALARM...')
i = 0
while i < len(start):
thisStop = stop.loc[(stop['Alarm ID'] == start.iloc[i].loc['Alarm ID'])]
j = thisStop['Message Time'].searchsorted([start.iloc[i].loc['Message Time']], side='left')[0]
match_found = False # Flag variable to track if a match is found
while (j < len(thisStop)) & (match_found == False):
if ((thisStop.iloc[j].loc['Message Time'] >= start.iloc[i].loc['Message Time']) & (thisStop.iloc[j].loc['Alarm ID'] == start.iloc[i].loc['Alarm ID'])):
start.at[i, 'Terminated Time'] = thisStop.iloc[j].loc['Message Time']
start.at[i, 'Termination Type'] = thisStop.iloc[j].loc['Termination Type']
start.at[i, 'Alarm Active Seconds'] = int((start.iloc[i].loc['Terminated Time'] - start.iloc[i].loc['Message Time']).total_seconds())
match_found = True
j = len(thisStop)
else:
j += 1
if not match_found:
start.at[i, 'Terminated Time'] = 'undefined'
start.at[i, 'Alarm Active Seconds'] = 'undefined'
if ((i % 1000 == 0) | (i == len(start)-1)):
print(f'{i} OF {len(start)} COMPLETE.')
print("%s SECONDS ELAPSED." % round((time.time() - start_time), 0)) # https://stackoverflow.com/questions/1557571/
print('')
i += 1
print("%s HOURS ELAPSED." % round(((time.time() - start_time)/3600), 2))
return start
I also tried filtering the stop DataFrame by Alarm ID and Message Time, and then returning the first element if the result was larger than 0.
def alarmActiveTime(start, stop):
print('CALCULATING ACTIVE TIME OF ALARM...')
i = 0
while i < len(start):
thisStop = stop.loc[(stop['Alarm ID'] == start.iloc[i].loc['Alarm ID'])]
thisStop = thisStop.loc[thisStop['Message Time'] >= start.iloc[i].loc['Message Time']]
if (len(thisStop) > 0):
start.at[i, 'Terminated Time'] = thisStop.iloc[0].loc['Message Time']
start.at[i, 'Termination Type'] = thisStop.iloc[0].loc['Termination Type']
start.at[i, 'Alarm Active Seconds'] = int((start.iloc[i].loc['Terminated Time'] - start.iloc[i].loc['Message Time']).total_seconds())
else:
start.at[i, 'Terminated Time'] = 'undefined'
start.at[i, 'Alarm Active Seconds'] = 'undefined'
if ((i % 1000 == 0) | (i == len(start)-1)):
print(f'{i} OF {len(start)} COMPLETE.')
print("%s SECONDS ELAPSED." % round((time.time() - start_time), 0)) # https://stackoverflow.com/questions/1557571/
print('')
i += 1
return start
These methods were both faster than searching through an unfiltered stop log, but they were roughly as fast as each other when I timed the first 20,000 loops or so. I presume that filtering by Message Time could have a greater impact as I work through the list. Is there anything else I could do to run this kind of loop faster, or is it just going to take that long to do this many comparisons?
Edit: I have 14274 unique alarm ID values in the start dataframe. When I tried to merge the dataframes together, I got an array memory error. I cannot post a screenshot of the alarm logs, but will post another update after I've got something masked that I can post.