I am working on one CSV file below using PySpark(on databricks), but I am not sure how to get the total scan (event name) duration time. Assume one scan per time.
timestamp | event | value | |
---|---|---|---|
1 | 2020-11-17_19:15:33.438102 | scan | start |
2 | 2020-11-17_19:18:33.433002 | scan | end |
3 | 2020-11-17_20:05:21.538125 | scan | start |
4 | 2020-11-17_20:13:08.528102 | scan | end |
5 | 2020-11-17_21:23:19.635104 | pending | start |
6 | 2020-11-17_21:33:26.572123 | pending | end |
7 | 2020-11-17_22:05:29.738105 | pending | start |
......... |
Below are some of my thoughts:
first get scan start time
scan_start = df[(df['event'] == 'scan') & (df['value'] == 'start')]
scan_start_time = scan_start['timestamp']
get scan end time
scan_end = df[(df['event'] == 'scan') & (df['value'] == 'end')]
scan_end_time = scan_start['timestamp']
the duration of each scan
each_duration = scan_end_time.values - scan_start_time.values
total duration
total_duration_ns = each_duration.sum()
But, I am not sure how to do the calculation in PySpark.
First, do we need to create a schema to pre-define the column name 'timestamp' type in timestamp? (Assume all the column types (timestamp, event, value) are in str type)
On the other hand, assume we have multiple(1000+files) similar to the above CSV files stored in databricks, how can we create a reusable code for all the CSV files. Eventually, create one table to store info of the total scan_duration.
Can someone please share with me some code in PySpark? Thank you so much