0

The following is the sample input for the data which has Table_event_log(device_id, video_id, event_timestamp, event_type) as attributes

We need to calculate the difference of the timestamp for the device based on play/stop event and calculate the total watch time for the each video_id

Following given is example for one video

time difference between play and stop for Android Device with video_id 1 is having 1 minute as watch time.

time difference between play and stop for Apple Device with video_id 1 is having 1 minute as watch time. So, the total becomes 2 minutes for video_id 1

data1=[("Android",1,'2021-07-24 12:01:19.000',"play"),("Android",1,'2021-07-24 12:02:19.000',"stop"),
       ("Apple",1,'2021-07-24 12:03:19.000',"play"),("Apple",1,'2021-07-24 12:04:19.000',"stop"),]

schema1=StructType([StructField('device_id', StringType(),True),
                    StructField('video_id',IntegerType(),True),
                    StructField('event_timestamp',StringType(),True),
                    StructField('event_type',StringType(),True)
                       ])

transaction=spark.createDataFrame(data1,schema=schema1)
transaction=transaction.withColumn("Converted_timestamp",to_timestamp("event_timestamp"))
  • https://stackoverflow.com/questions/8142364/how-to-compare-two-dates https://docs.python.org/3/library/datetime.html#timedelta-objects you can convert the strings into python datetime objects and subtract them to get the timedelta which will have the difference. This object can be converted back to string if needed. – Poiuy Jun 24 '22 at 18:10
  • 1. Which one is correct if you have _2 consecutive Stops_? 2. Which one is correct if you have _2 consecutive Starts_? 3. What's the time played if at the end you have Start, but don't have Stop? – ZygD Jun 27 '22 at 03:38

1 Answers1

0

try this:

transaction = (
    transaction
    .withColumn("Converted_timestamp", f.to_timestamp("event_timestamp"))
    .withColumn('next_event_timestamp', f.lead('Converted_timestamp').over(Window.partitionBy('device_id', 'video_id').orderBy('event_timestamp')))
    .where(f.col('event_type') == 'play')
    .withColumn('play_time_minutes', (f.col('next_event_timestamp').cast(IntegerType()) - f.col('Converted_timestamp').cast(IntegerType()))/60)
    .groupBy('video_id')
    .agg(f.sum('play_time_minutes').alias('totatl_play_time_minutes'))
)
ARCrow
  • 1,360
  • 1
  • 10
  • 26