I'm working on a script that calculate time unavailability of some equipment I maintain.
I have as an input a csv file of our supervising tool (around 2M lines), containing the alarms for a month.
Problem is, it takes a huge time to process it!
Once converted as a Pandas DF, I have a df with these columns:
['date','alarm_key','pcause_id_hex','activity','model_name']
- Date : timestamp of the alarm
- Alarm_key : id of the alarm
- Pcause_id_hex : description of the alarm
- Activity : Generated / Cleared (Generated means the alarm started, and Cleared means it ended)
- Model_name : name of the equipment
The alarm_key is the same when it's generated and when it's cleared.
I want as an output a new dataframe which contains :
['station','name','start_date','end_date','duration']
- Station and Name : I get it from the model_name
- Start_date : date of the "Generated" alarm
- End_date : date of the "Cleared" alarm
- Duration : I have a function that calculates it
Below is my code:
df = pd.DataFrame([
['01/03/2022 00:01','5693392','CONNECTION KO','Generated','Equip1_Station1'],
['01/03/2022 00:02','5693334','CONNECTION KO','Cleared','Equip2_Station2'],
['01/03/2022 00:02','5693352','CONNECTION KO','Generated','Equip3_Station3'],
['01/03/2022 02:02','5693392','CONNECTION KO','Cleared','Equip1_Station1']
],
columns=['date','alarm_key','pcause_id_hex','activity','model_name']
)
list_alarms = [{}]
for i, row in df.iterrows():
# Process row information
row_info = {
'date': row['date'],
'alarm_key': row['alarm_key'],
'pcause_id_hex': row['pcause_id_hex'],
'activity': row['activity'],
'model_name': row['model_name'],
}
# Check if it's a generated alarm
if row_info['activity'] == 'Generated':
alarm_info = {
'station': '',
'name': '',
'start_date': '',
'end_date': '',
'duration': 0
}
# Fill name / station info
if re.search('_', row_info['model_name']):
alarm_info['name'] = row_info['model_name'].split('_', 1)[
0]
alarm_info['station'] = row_info['model_name'].split('_', 1)[
1]
else:
alarm_info['name'] = ''
alarm_info['station'] = row_info['model_name']
# Fill start date
alarm_info['start_date'] = row_info['date']
start_datetime = datetime.strptime(
row_info['date'], '%d/%m/%Y %H:%M')
# Search for next iteration of the alarm key
row_cleared = df.loc[(df['alarm_key'] == row_info['alarm_key']) & (
df['date'] > row_info['date'])]
if not row_cleared.empty:
# If found, get end date
end_date = row_cleared.iloc[0, 0]
alarm_info['end_date'] = end_date
end_datetime = datetime.strptime(
end_date, '%d/%m/%Y %H:%M')
else:
# If not found, set end date to last day of the month
end_datetime = start_datetime.replace(day=monthrange(
start_datetime.year, start_datetime.month)[1])
alarm_info['end_date'] = end_datetime.strftime(
'%d/%m/%Y %H:%M')
# Calculate duration of the alarm
alarm_info['duration'] = _get_unavailability_time(
start_datetime, end_datetime)
list_alarms.append(alarm_info)
list_alarms.pop(0)
df_output = pd.DataFrame(list_alarms)
For the example set in the code above, I would like a result like this one :
station name start_date end_date duration
0 Station1 Equip1 01/03/2022 00:01 01/03/2022 02:02 0.983333
1 Station3 Equip3 01/03/2022 00:02 31/03/2022 00:02 600.000000
I iterate through the dataframe, getting the row info. If it's a Generated one, I look for the next iteration of the alarm key with a Cleared activity. Once done, I store the end date in a list containing the information related to the alarm. (If an alarm isn't cleared, I set the end date as the last day of the month)
I don't know how to speed it up way more. (as you may see, I'm absolutely not an expert in this)
If you have some suggestions to improve the process, please let me know!