Pyspark Looking at first row and second row value and update the data sequentially

Question

In pyspark say suppose we have three column Start_date, duration, End_date. How can i look at the first rows end_date and second row Start_date. if second row start_date is greater than first row end date do nothing otherwise if first rows End_date is less than Second row Start_date then replace the second row start_date with first row end_date and add duration of second row to start_date and replace end_date of second row second row with new value. and do it for complete one group of ID.

it would help others answer your question if you could provide a reproducible example for your dataframe and required output. — murtihash, Apr 25 '20 at 15:48
@MohammadMurtazaHashmi - True but since i am new to Stack attaching Image is not allowed for me as of now. I tried attaching image now see if you can see it in my post. — pallav kumar, Apr 25 '20 at 16:04
[Please do not post images of code/data as they cant be copied](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question) , , it would help if you create a reproducible example , Take a look at [How to make good reproducible Apache Spark examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) — anky, Apr 25 '20 at 16:35

score 0 · Answer 1 · answered Apr 25 '20 at 16:15

0

Use window lag/lead functions partitionBy id, orderBy start_date to compare first rows end_Date with second row start_date.

Use when otherwise statement with datediff function to caluculate difference of dates for duration column.

answered Apr 25 '20 at 16:15

notNull

30,258
4
35
50

can i look at both rows at once. using lag function i know i can define a new column but here i want to update the end date sequentially.. so in one statement can do this operation . can i write a function somethig like . when (lag 1 ,window) end_date < start_date , start_date = Lag1 ,window End_date , end date = start_date which is updated + duration , Else do nothing. – pallav kumar Apr 25 '20 at 16:24

Pyspark Looking at first row and second row value and update the data sequentially

1 Answers1