Assume we have a DataFrame(df) below in PySpark. And, how to use PySpark to get the duration (in mins) between each biking event and the driving event? And, assume for each biking will have only one corresponding driving event; However, within a day there could be multiple "biking - driving" pairs. Eventually, store the final results into data frame, including biking__time, biking_event, driving_time, driving_event, and each_durations, etc)
Notes: there can be other events between biking and driving, such as a person can start from biking, running, swimming then driving.
One example refers to below table:
The duration of date 03/01/2018 between biking and driving is: 8:12 - 5:12 = 3 hours = 180 mins
TimeDetails | Event | |
---|---|---|
1 | 3/1/2018 5:12 | Biking |
2 | 3/1/2018 6:12 | Swimming |
3 | 3/1/2018 7:12 | Hiking |
4 | 3/1/2018 8:12 | Driving |
5 | 3/2/2018 9:12 | Biking |
6 | 3/2/2018 10:12 | Swimming |
7 | 3/2/2018 11:12 | Swimming |
8 | 3/2/2018 12:12 | Driving |
9 | 3/2/2018 13:12 | Swimming |
Below is the sample output:
biking_time | event_name1 | driving_time | event_name2 | durations_inMins | |
---|---|---|---|---|---|
1 | 3/1/2018 5:12 | biking | 3/1/2018 8:12 | driving | 180 |
2 | 3/2/2018 9:12 | biking | 3/2/2018 12:12 | driving | 180 |
Below is some of my code:
biking_df = df.filter(df.Event == 'Biking)
driving_df = df.filter(df.Event == 'Driving')
Can someone please provide me with some code in PySpark? Thanks a lot