I have this code in Python Pandas, with a dataframe 'df' which contains the columns 'Connectivity_Tmstp', 'sensor_id' and 'duration_seconds':
df.set_index('Connectivity_Tmstp', inplace=True)
grouper_hour = df.groupby([pd.Grouper(freq='H'), 'sensor_id'])`
result_stop_hour = grouper_hour['duration_seconds'].sum()`
kpi_df = result_stop_hour.to_frame()`
This code allows me to put the column 'Connectivity_Tmstp' in index, then to do a groupby on hours and sensor_id. Finally I can sum the hours in each group and put the result in a new dataframe like this:
Connectivity_Tmstp | sensor_id | duration_seconds
2018-10-14 07:00:00 | 70b3d5e75e003fb7 | 60
| 70b3d5e75e004348 | 40
| 70b3d5e75e00435e | 20
2018-11-02 07:00:00 | 70b3d5e75e0043b3 | 80
| 70b3d5e75e0043d7 | 10
| 70b3d5e75e0043da | 60
2019-07-18 12:00:00 | 70b3d5e75e003fb8 | 40
| 70b3d5e75e00431c | 10
| 70b3d5e75e0043c1 | 20
| 70b3d5e75e0043da | 30
Do you know how to do the same thing in PySpark?
Thanks for your answer.
Regards, Fab