Create timeseries (range) between two given timestamp

Question

I have a pyspark dataframe df

---------------------------------------------------------
primaryKey |   start_timestamp   |   end_timestamp
---------------------------------------------------------
 key1      | 2020-08-13 15:40:00 | 2020-08-13 15:44:47
 key2      | 2020-08-14 12:00:00 | 2020-08-14 12:01:13

I want to create a dataframe that will have a timeseries that is between start_timestamp and end_timestamp for all keys at a gap of x seconds. For example for a gap of x = 120 seconds the output will be as:-

-----------------------------------------------------------
primaryKey |  start_timestamp_new  | end_timestamp_new
   key1    |  2020-08-13 15:40:00  | 2020-08-13 15:41:59
   key1    |  2020-08-13 15:42:00  | 2020-08-13 15:43:59
   key1    |  2020-08-13 15:44:00  | 2020-08-13 15:45:59
   key2    |  2020-08-14 12:00:00  | 2020-08-14 12:01:59

I am trying to use the approach mentiond here, but unable to apply it to a spark dataframe.

Any info on creating this will be a huge help.

Lamanus · Accepted Answer · 2020-08-28T12:39:25.803

2

You can use sequence function.

x = 120

df.withColumn('start_timestamp', to_timestamp('start_timestamp')) \
  .withColumn('end_timestamp', to_timestamp('end_timestamp')) \
  .withColumn('start_timestamp', explode(sequence('start_timestamp', 'end_timestamp', expr(f'interval {x} seconds')))) \
  .withColumn('end_timestamp', col('start_timestamp') + expr(f'interval {x - 1} seconds')) \
  .show()

+----------+-------------------+-------------------+
|primaryKey|    start_timestamp|      end_timestamp|
+----------+-------------------+-------------------+
|      key1|2020-08-13 15:40:00|2020-08-13 15:41:59|
|      key1|2020-08-13 15:42:00|2020-08-13 15:43:59|
|      key1|2020-08-13 15:44:00|2020-08-13 15:45:59|
|      key2|2020-08-14 12:00:00|2020-08-14 12:01:59|
+----------+-------------------+-------------------+

edited Aug 28 '20 at 12:39

answered Aug 28 '20 at 12:06

Lamanus

12,898
4
21
47

then should i use a lead window function on time column ? in order to get the data in the desired output ? – Himanshu Malhotra Aug 28 '20 at 12:15
1

No, simply add x-1 – Lamanus Aug 28 '20 at 12:39
can you also help me in this query - https://stackoverflow.com/questions/66000491/get-20th-to-80th-percentile-of-each-group-pyspark – Himanshu Malhotra Feb 01 '21 at 21:38

Create timeseries (range) between two given timestamp

1 Answers1