The following is the sample input for the data which has Table_event_log(device_id, video_id, event_timestamp, event_type) as attributes
We need to calculate the difference of the timestamp for the device based on play/stop event and calculate the total watch time for the each video_id
Following given is example for one video
time difference between play and stop for Android Device with video_id 1 is having 1 minute as watch time.
time difference between play and stop for Apple Device with video_id 1 is having 1 minute as watch time. So, the total becomes 2 minutes for video_id 1
data1=[("Android",1,'2021-07-24 12:01:19.000',"play"),("Android",1,'2021-07-24 12:02:19.000',"stop"),
("Apple",1,'2021-07-24 12:03:19.000',"play"),("Apple",1,'2021-07-24 12:04:19.000',"stop"),]
schema1=StructType([StructField('device_id', StringType(),True),
StructField('video_id',IntegerType(),True),
StructField('event_timestamp',StringType(),True),
StructField('event_type',StringType(),True)
])
transaction=spark.createDataFrame(data1,schema=schema1)
transaction=transaction.withColumn("Converted_timestamp",to_timestamp("event_timestamp"))