I'm designing a function to return a summary of a 500k rows dataframe. The function consists in returning a grouped top 5 total amount of time in a month period and the the daily amount in the respective consecutive columns.
The original DataFrame is something like this:
| Index | DepartmentId | Duration | EventDescription | Date |
| ----- | ------------ | -------- | -----------------|--------------------|
| 0 | 0 | 24 | Event 1 |2022-06-29 00:00:00 |
| 1 | 1 | 36 | Event 2 |2022-06-08 00:00:00 |
| ... | ... | ... | ... |... |
The the final result should be something like this:
| Index | DepartmentId | EventDescription | Total_Duration | 2022-06-01 |2022-06-02... |
| ----- | ------------ | ---------------- | ---------------|------------|--------------|
| 0 | 1 | 1st Event | 271.2 | 0.52 |0.10... |
| 0 | 1 | 2nd Event | 200.9 | 0.85 |0.52... |
| 0 | 1 | 3rd Event | 154.2 | 0.24 |0.52... |
| 0 | 1 | 4th Event | 85.2 | 1.25 |0.52... |
| 0 | 1 | 5th Event | 24.5 | 2.52 |0.52... |
| 0 | 2 | 1st Event | 290.2 | 0.57 |........ |
I was able to do it using a for loop, but I'm pretty sure this is not the most efficient way to complete this task.
def sum_in_hours(data):
return data.sum()/3600
def main_stops_reasons(events_data,month, type_indicator):
# Calculate filters for the DataFrame
status = events_data['Status'] == 0
type_indicator = events_data['TypeIndicatorId'] == type_indicator
type_event = events_data['TypeEventId'].isin([3,4])
month_check = events_data['Month'] == month
hidden = events_data['Hidden'] == 0
filters = status & type_indicator & type_event & month_check & hidden
# Calculate the first part of the problem: The top 5 sum of columns Duration per DepartmentId
answer = events_data[filters].groupby(['DepartmentId','EventDescription']).agg({'Duration':sum_in_hours}).sort_values(['DepartmentId','Duration'],ascending=[True,False])
answer = answer.groupby('DepartmentId').head(5)
answer = answer.reset_index()
# Calculate the daily sum of Duration per Event and per DepartmentId
daily = events_data[filters].groupby(['DepartmentId', 'Date','EventDescription']).agg({'Duration':sum_in_hours}).sort_values(['DepartmentId','EventDescription','Date'],ascending=[True,True,True])
daily = daily.reset_index()
# Find the unique day values and create one column for each day found according to the filters
columns = daily['Date'].sort_values().unique()
answer[columns] = np.nan
# Fill every daily value by iterating row by row and column by column
for c in columns:
for r in range(len(answer)):
department_filter = daily['DepartmentId'] == answer.iloc[r][0]
event_filter = daily['EventDescription'] == answer.iloc[r][1]
daily_filter = department_filter & event_filter
answer[c][r] = daily[daily_filter][daily[daily_filter]['Date'] == c]['Duration']
return answer
Please, could someone help me to find a better solutions or another approach to this problem?