0

I'm designing a function to return a summary of a 500k rows dataframe. The function consists in returning a grouped top 5 total amount of time in a month period and the the daily amount in the respective consecutive columns.

The original DataFrame is something like this:

| Index | DepartmentId | Duration | EventDescription | Date               |
| ----- | ------------ | -------- | -----------------|--------------------|
| 0     | 0            | 24       | Event 1          |2022-06-29 00:00:00 |
| 1     | 1            | 36       | Event 2          |2022-06-08 00:00:00 |
| ...   | ...          | ...      | ...              |...                 |

The the final result should be something like this:

| Index | DepartmentId | EventDescription | Total_Duration | 2022-06-01 |2022-06-02... |
| ----- | ------------ | ---------------- | ---------------|------------|--------------|
| 0     | 1            | 1st Event        | 271.2          | 0.52       |0.10...       |
| 0     | 1            | 2nd Event        | 200.9          | 0.85       |0.52...       |  
| 0     | 1            | 3rd Event        | 154.2          | 0.24       |0.52...       |
| 0     | 1            | 4th Event        | 85.2           | 1.25       |0.52...       |
| 0     | 1            | 5th Event        | 24.5           | 2.52       |0.52...       |
| 0     | 2            | 1st Event        | 290.2          | 0.57       |........      |

I was able to do it using a for loop, but I'm pretty sure this is not the most efficient way to complete this task.

def sum_in_hours(data):
    return data.sum()/3600    

def main_stops_reasons(events_data,month, type_indicator):
    
    # Calculate filters for the DataFrame
    status = events_data['Status'] == 0
    type_indicator = events_data['TypeIndicatorId'] == type_indicator
    type_event = events_data['TypeEventId'].isin([3,4])
    month_check = events_data['Month'] == month
    hidden = events_data['Hidden'] == 0
    filters = status & type_indicator & type_event & month_check & hidden

    # Calculate the first part of the problem: The top 5 sum of columns Duration per DepartmentId
    answer = events_data[filters].groupby(['DepartmentId','EventDescription']).agg({'Duration':sum_in_hours}).sort_values(['DepartmentId','Duration'],ascending=[True,False])
    answer = answer.groupby('DepartmentId').head(5)
    answer = answer.reset_index()
    
    #  Calculate the daily sum of Duration per Event and per DepartmentId
    daily = events_data[filters].groupby(['DepartmentId', 'Date','EventDescription']).agg({'Duration':sum_in_hours}).sort_values(['DepartmentId','EventDescription','Date'],ascending=[True,True,True])
    daily = daily.reset_index()
    
    # Find the unique day values and create one column for each day found according to the filters
    columns = daily['Date'].sort_values().unique()
    answer[columns] = np.nan
    
    # Fill every daily value by iterating row by row and column by column
    for c in columns:
        for r in range(len(answer)):
            department_filter = daily['DepartmentId'] == answer.iloc[r][0]
            event_filter = daily['EventDescription'] == answer.iloc[r][1]
            daily_filter = department_filter & event_filter
            answer[c][r] = daily[daily_filter][daily[daily_filter]['Date'] == c]['Duration']
            
    return answer

Please, could someone help me to find a better solutions or another approach to this problem?

Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119

1 Answers1

0

you can group by "DepartmentId", "EventDescription", "Date" than unstack the Date. It will create NAN for those who do not have data

df.groupby(["DepartmentId", "EventDescription", "Date"]).agg("sum").unstack(level=2).reset_index()
galaxyan
  • 5,944
  • 2
  • 19
  • 43