0

I have input pyspark dataframe with columns like ID,StartDatetime,EndDatetime. I want to add new column named newdate based on startdatetime and enddatetime.

Input DF :-

ID  StartDatetime       EndDatetime
1   21-06-2021 07:00    24-06-2021 16:00
2   21-06-2021 07:00    22-06-2021 16:00

required output :-

ID  StartDatetime       EndDatetime         newdate
1   21-06-2021 07:00    24-06-2021 16:00    21-06-2021
1   21-06-2021 07:00    24-06-2021 16:00    22-06-2021
1   21-06-2021 07:00    24-06-2021 16:00    23-06-2021
1   21-06-2021 07:00    24-06-2021 16:00    24-06-2021
2   21-06-2021 07:00    22-06-2021 16:00    21-06-2021
2   21-06-2021 07:00    22-06-2021 16:00    22-06-2021
jsN00b
  • 3,584
  • 2
  • 8
  • 21
  • It is generally considered useful to post what attempts have been made to achieve the desired/required output already. That way, others may be able to help refine what is attempted & offer much more targeted solution/s. – jsN00b Apr 27 '22 at 13:12
  • i have tried to apply given solution in this link but need the same solution in pyspark https://stackoverflow.com/questions/53782338/how-to-add-new-rows-on-date-range-in-dataframe-in-pyspark – Deepak Kumar Apr 27 '22 at 13:16

1 Answers1

2

You can use explode and array_repeat to duplicate the rows.

I use a combination of row_number and date functions to get the date ranges between start and end dates:

from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.window import Window

w = Window().partitionBy("ID").orderBy('StartDatetime')

output_df = df.withColumn("diff", 1+F.datediff(F.to_date(F.unix_timestamp('EndDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp')), \
                                               F.to_date(F.unix_timestamp('StartDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp'))))\
              .withColumn('diff', F.expr('explode(array_repeat(diff,int(diff)))'))\
              .withColumn("diff", F.row_number().over(w))\
              .withColumn("start_dt", F.to_date(F.unix_timestamp('StartDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp')))\
              .withColumn("newdate", F.date_format(F.expr("date_add(start_dt, diff-1)"), 'dd-MM-yyyy')).drop('diff', 'start_dt')

Output:

output_df.orderBy("ID", "newdate").show()
+---+----------------+----------------+----------+
| ID|   StartDatetime|     EndDatetime|   newdate|
+---+----------------+----------------+----------+
|  1|21-06-2021 07:00|24-06-2021 16:00|21-06-2021|
|  1|21-06-2021 07:00|24-06-2021 16:00|22-06-2021|
|  1|21-06-2021 07:00|24-06-2021 16:00|23-06-2021|
|  1|21-06-2021 07:00|24-06-2021 16:00|24-06-2021|
|  2|21-06-2021 07:00|22-06-2021 16:00|21-06-2021|
|  2|21-06-2021 07:00|22-06-2021 16:00|22-06-2021|
+---+----------------+----------------+----------+

I dropped the diff column, but displaying it will help you understand the logic if it's not clear.

Cena
  • 3,316
  • 2
  • 17
  • 34