0

In pyspark say suppose we have three column Start_date, duration, End_date. How can i look at the first rows end_date and second row Start_date. if second row start_date is greater than first row end date do nothing otherwise if first rows End_date is less than Second row Start_date then replace the second row start_date with first row end_date and add duration of second row to start_date and replace end_date of second row second row with new value. and do it for complete one group of ID.

Milad Bahmanabadi
  • 946
  • 11
  • 27
  • 1
    it would help others answer your question if you could provide a reproducible example for your dataframe and required output. – murtihash Apr 25 '20 at 15:48
  • @MohammadMurtazaHashmi - True but since i am new to Stack attaching Image is not allowed for me as of now. I tried attaching image now see if you can see it in my post. – pallav kumar Apr 25 '20 at 16:04
  • 1
    [Please do not post images of code/data as they cant be copied](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question) , , it would help if you create a reproducible example , Take a look at [How to make good reproducible Apache Spark examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) – anky Apr 25 '20 at 16:35

1 Answers1

0

Use window lag/lead functions partitionBy id, orderBy start_date to compare first rows end_Date with second row start_date.

notNull
  • 30,258
  • 4
  • 35
  • 50
  • can i look at both rows at once. using lag function i know i can define a new column but here i want to update the end date sequentially.. so in one statement can do this operation . can i write a function somethig like . when (lag 1 ,window) end_date < start_date , start_date = Lag1 ,window End_date , end date = start_date which is updated + duration , Else do nothing. – pallav kumar Apr 25 '20 at 16:24