0

I have data in table as :

Timestamp,value1,value2
10/22/2020 0:01,1,0
10/22/2020 0:04,1,0
10/22/2020 0:06,1,0
10/22/2020 0:15,2,1
10/22/2020 0:20,2,1
10/22/2020 0:25,1,2
10/22/2020 0:30,1,2
10/22/2020 0:34,1,2
10/22/2020 0:35,1,2
10/22/2020 0:45,0,1
10/22/2020 0:55,0,1

I want to convert it into following structure :

Start Timestamp,End Timestamp,value1,value2
10/22/2020 0:01,10/22/2020 0:04,1,0
10/22/2020 0:04,10/22/2020 0:06,1,0
10/22/2020 0:06,10/22/2020 0:15,2,1
10/22/2020 0:15,10/22/2020 0:20,2,1
10/22/2020 0:20,10/22/2020 0:25,1,2
10/22/2020 0:25,10/22/2020 0:30,1,2
10/22/2020 0:30,10/22/2020 0:34,1,2
10/22/2020 0:34,10/22/2020 0:35,1,2
10/22/2020 0:35,10/22/2020 0:45,0,1
10/22/2020 0:45,10/22/2020 0:55,0,1

I am using window function to achieve this but i am not sure how the data should be partitioned so that I can achieve above.

Help appreciated.

Utkarsh Saraf
  • 475
  • 8
  • 31

1 Answers1

0

You can use the lead function over a window without partitions.

from pyspark.sql.window import Window
from pyspark.sql.functions import *

w=Window().orderBy("Timestamp")

df.withColumn("End Timestamp", lead("Timestamp").over(w))\
    .withColumn("value1", lead("value1").over(w))\
    .withColumn("value2", lead("value2").over(w))\
    .filter(col("End Timestamp").isNotNull())\
    .withColumnRenamed("Timestamp", "Start Timestamp")\
    .select("Start Timestamp", "End Timestamp", "value1", "value2").show()

+---------------+---------------+------+------+
|Start Timestamp|  End Timestamp|value1|value2|
+---------------+---------------+------+------+
|10/22/2020 0:01|10/22/2020 0:04|     1|     0|
|10/22/2020 0:04|10/22/2020 0:06|     1|     0|
|10/22/2020 0:06|10/22/2020 0:15|     2|     1|
|10/22/2020 0:15|10/22/2020 0:20|     2|     1|
|10/22/2020 0:20|10/22/2020 0:25|     1|     2|
|10/22/2020 0:25|10/22/2020 0:30|     1|     2|
|10/22/2020 0:30|10/22/2020 0:34|     1|     2|
|10/22/2020 0:34|10/22/2020 0:35|     1|     2|
|10/22/2020 0:35|10/22/2020 0:45|     0|     1|
|10/22/2020 0:45|10/22/2020 0:55|     0|     1|
+---------------+---------------+------+------+

Note: Using a window without partitions can have a performance impact. You can read about it here.

Cena
  • 3,316
  • 2
  • 17
  • 34