Need to add date ranges between two date columns in pyspark?

Question

I have input pyspark dataframe with columns like ID,StartDatetime,EndDatetime. I want to add new column named newdate based on startdatetime and enddatetime.

Input DF :-

ID  StartDatetime       EndDatetime
1   21-06-2021 07:00    24-06-2021 16:00
2   21-06-2021 07:00    22-06-2021 16:00

required output :-

ID  StartDatetime       EndDatetime         newdate
1   21-06-2021 07:00    24-06-2021 16:00    21-06-2021
1   21-06-2021 07:00    24-06-2021 16:00    22-06-2021
1   21-06-2021 07:00    24-06-2021 16:00    23-06-2021
1   21-06-2021 07:00    24-06-2021 16:00    24-06-2021
2   21-06-2021 07:00    22-06-2021 16:00    21-06-2021
2   21-06-2021 07:00    22-06-2021 16:00    22-06-2021

It is generally considered useful to post what attempts have been made to achieve the desired/required output already. That way, others may be able to help refine what is attempted & offer much more targeted solution/s. — jsN00b, Apr 27 '22 at 13:12
i have tried to apply given solution in this link but need the same solution in pyspark https://stackoverflow.com/questions/53782338/how-to-add-new-rows-on-date-range-in-dataframe-in-pyspark — Deepak Kumar, Apr 27 '22 at 13:16

Cena · Accepted Answer · 2022-04-27T18:43:32.270

You can use explode and array_repeat to duplicate the rows.

I use a combination of row_number and date functions to get the date ranges between start and end dates:

from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.window import Window

w = Window().partitionBy("ID").orderBy('StartDatetime')

output_df = df.withColumn("diff", 1+F.datediff(F.to_date(F.unix_timestamp('EndDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp')), \
                                               F.to_date(F.unix_timestamp('StartDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp'))))\
              .withColumn('diff', F.expr('explode(array_repeat(diff,int(diff)))'))\
              .withColumn("diff", F.row_number().over(w))\
              .withColumn("start_dt", F.to_date(F.unix_timestamp('StartDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp')))\
              .withColumn("newdate", F.date_format(F.expr("date_add(start_dt, diff-1)"), 'dd-MM-yyyy')).drop('diff', 'start_dt')

Output:

output_df.orderBy("ID", "newdate").show()
+---+----------------+----------------+----------+
| ID|   StartDatetime|     EndDatetime|   newdate|
+---+----------------+----------------+----------+
|  1|21-06-2021 07:00|24-06-2021 16:00|21-06-2021|
|  1|21-06-2021 07:00|24-06-2021 16:00|22-06-2021|
|  1|21-06-2021 07:00|24-06-2021 16:00|23-06-2021|
|  1|21-06-2021 07:00|24-06-2021 16:00|24-06-2021|
|  2|21-06-2021 07:00|22-06-2021 16:00|21-06-2021|
|  2|21-06-2021 07:00|22-06-2021 16:00|22-06-2021|
+---+----------------+----------------+----------+

I dropped the diff column, but displaying it will help you understand the logic if it's not clear.

Need to add date ranges between two date columns in pyspark?

1 Answers1