Pandas - How to speed up a process on a huge dataframe

Question

I'm working on a script that calculate time unavailability of some equipment I maintain.

I have as an input a csv file of our supervising tool (around 2M lines), containing the alarms for a month.

Problem is, it takes a huge time to process it!

Once converted as a Pandas DF, I have a df with these columns:

['date','alarm_key','pcause_id_hex','activity','model_name']

Date : timestamp of the alarm
Alarm_key : id of the alarm
Pcause_id_hex : description of the alarm
Activity : Generated / Cleared (Generated means the alarm started, and Cleared means it ended)
Model_name : name of the equipment

The alarm_key is the same when it's generated and when it's cleared.

I want as an output a new dataframe which contains :

['station','name','start_date','end_date','duration']

Station and Name : I get it from the model_name
Start_date : date of the "Generated" alarm
End_date : date of the "Cleared" alarm
Duration : I have a function that calculates it

Below is my code:

df = pd.DataFrame([
        ['01/03/2022 00:01','5693392','CONNECTION KO','Generated','Equip1_Station1'],
        ['01/03/2022 00:02','5693334','CONNECTION KO','Cleared','Equip2_Station2'],
        ['01/03/2022 00:02','5693352','CONNECTION KO','Generated','Equip3_Station3'],
        ['01/03/2022 02:02','5693392','CONNECTION KO','Cleared','Equip1_Station1']
    ],
        columns=['date','alarm_key','pcause_id_hex','activity','model_name']
    )

list_alarms = [{}]

for i, row in df.iterrows():
    # Process row information
    row_info = {
        'date': row['date'],
        'alarm_key': row['alarm_key'],
        'pcause_id_hex': row['pcause_id_hex'],
        'activity': row['activity'],
        'model_name': row['model_name'],
    }
    # Check if it's a generated alarm
    if row_info['activity'] == 'Generated':
        alarm_info = {
            'station': '',
            'name': '',
            'start_date': '',
            'end_date': '',
            'duration': 0
        }
        # Fill name / station info
        if re.search('_', row_info['model_name']):
            alarm_info['name'] = row_info['model_name'].split('_', 1)[
                0]
            alarm_info['station'] = row_info['model_name'].split('_', 1)[
                1]
        else:
            alarm_info['name'] = ''
            alarm_info['station'] = row_info['model_name']

        # Fill start date
        alarm_info['start_date'] = row_info['date']
        start_datetime = datetime.strptime(
            row_info['date'], '%d/%m/%Y %H:%M')

        # Search for next iteration of the alarm key
        row_cleared = df.loc[(df['alarm_key'] == row_info['alarm_key']) & (
            df['date'] > row_info['date'])]
        if not row_cleared.empty:
            # If found, get end date
            end_date = row_cleared.iloc[0, 0]
            alarm_info['end_date'] = end_date
            end_datetime = datetime.strptime(
                end_date, '%d/%m/%Y %H:%M')
        else:
            # If not found, set end date to last day of the month
            end_datetime = start_datetime.replace(day=monthrange(
                start_datetime.year, start_datetime.month)[1])
            alarm_info['end_date'] = end_datetime.strftime(
                '%d/%m/%Y %H:%M')
        # Calculate duration of the alarm
        alarm_info['duration'] = _get_unavailability_time(
            start_datetime, end_datetime)
        list_alarms.append(alarm_info)

list_alarms.pop(0)
df_output = pd.DataFrame(list_alarms)

For the example set in the code above, I would like a result like this one :

    station    name        start_date          end_date    duration
0  Station1  Equip1  01/03/2022 00:01  01/03/2022 02:02    0.983333
1  Station3  Equip3  01/03/2022 00:02  31/03/2022 00:02  600.000000

I iterate through the dataframe, getting the row info. If it's a Generated one, I look for the next iteration of the alarm key with a Cleared activity. Once done, I store the end date in a list containing the information related to the alarm. (If an alarm isn't cleared, I set the end date as the last day of the month)

I don't know how to speed it up way more. (as you may see, I'm absolutely not an expert in this)

If you have some suggestions to improve the process, please let me know!

Welcome to StackOverflow. In order to get a better response to your question, try adding a sample of your data, a clear description of what you want to accomplish, and a sample of your expected output in addition to the sample code you have already provided. — Chris, Jun 20 '22 at 14:54
Also it looks like you might be trying to create a pivot table.https://stackoverflow.com/questions/47152691/how-can-i-pivot-a-dataframe — Chris, Jun 20 '22 at 14:57
Please provide some sample data. See: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — BrokenBenchmark, Jun 20 '22 at 15:23
I followed your recommandations and links you gave me. You now have sample data and output visualization. — Gaellone, Jun 22 '22 at 15:50
Your sample output does not go with your sample input - for example, in your input, there is a *station 3* which is not present in your output. I also don't see `31/03/2022` anywhere in the input. — Mortz, Jun 22 '22 at 17:22
Sorry, changed by mistakes one input and did not run it with newer data. It's corrected. But the 31/03/2022 is normal, I consider an alarm not cleared as one that lasted until the end of the month — Gaellone, Jun 22 '22 at 19:48

score 0 · Accepted Answer · answered Jun 22 '22 at 17:37

IIUC, you want to split the dataset into a "generated" part and a "cleared" part -

df = pd.DataFrame([
        ['01/03/2022 00:01','5693392','CONNECTION KO','Generated','Equip1_Station1'],
                ['01/03/2022 00:02','5693334','CONNECTION KO','Cleared','Equip2_Station2'],
                        ['01/03/2022 00:02','5693334','CONNECTION KO','Generated','Equip2_Station2'],
                                ['01/03/2022 02:02','5693392','CONNECTION KO','Cleared','Equip1_Station1']
                                    ],
                                            columns=['date','alarm_key','pcause_id_hex','activity','model_name']
                                                )

df_gen = df[df['activity'] == 'Generated']
df_clr = df[df['activity'] == 'Cleared']
df_gen = df_gen.merge(df_clr[['date', 'alarm_key']], on=['alarm_key'], how='inner')
df_gen[['equipment', 'station']] = df_gen['model_name'].str.split('_', expand=True)

Output

# df_gen
             date_x alarm_key  pcause_id_hex   activity       model_name            date_y equipment   station
0  01/03/2022 00:01   5693392  CONNECTION KO  Generated  Equip1_Station1  01/03/2022 02:02    Equip1  Station1
1  01/03/2022 00:02   5693334  CONNECTION KO  Generated  Equip2_Station2  01/03/2022 00:02    Equip2  Station2

This looks like a good approach yes. I'll try something like that ! I just have to implement the case when I don't have an end for the alarm and it should work. Thanks ! — Gaellone, Jun 22 '22 at 20:03
As we are doing an `inner` join here, the output will only contain entries for those "generated" cases that have a corresponding "cleared" as well. — Mortz, Jun 23 '22 at 06:56
Yes I used your example to implement my solution. I'm using a left join instead so I get all the generated one, even the one not cleared. Thanks for your help — Gaellone, Jun 24 '22 at 09:04

Pandas - How to speed up a process on a huge dataframe

1 Answers1