For Spark 2.4+ it is possible to get the number of days without the usage of numpy or udf. Using the built-in SQL functions is sufficient.
Following roughly this answer we can
- create an array of dates containing all days between begin and end by using sequence
- transform the single days into a struct holding the day and its day of week value
- filter out the days that are Saturdays and Sundays
- get the size of the remaining array
#create an array containing all days between begin and end
(df.withColumn('days', F.expr('sequence(begin, end, interval 1 day)'))
#keep only days where day of week (dow) <= 5 (Friday)
.withColumn('weekdays', F.expr('filter(transform(days, day->(day, extract(dow_iso from day))), day -> day.col2 <=5).day'))
#count how many days are left
.withColumn('no_of_weekdays', F.expr('size(weekdays)'))
#drop the intermediate columns
.select('begin', 'end', 'no_of_weekdays')
.show(truncate=False))
Output:
+----------+----------+--------------+
|begin |end |no_of_weekdays|
+----------+----------+--------------+
|2020-09-19|2020-09-20|0 |
|2020-09-21|2020-09-24|4 |
|2020-09-21|2020-09-25|5 |
|2020-09-21|2020-09-26|5 |
|2020-09-21|2020-10-02|10 |
|2020-09-19|2020-10-03|10 |
+----------+----------+--------------+
For Spark <= 2.3 you would have to use an udf. If numpy is a problem a solution inspired by this answer can be used.
from datetime import timedelta
@F.udf
def dateDiffWeekdays(end, start):
daygenerator = (start + timedelta(x) for x in range((end - start).days + 1))
return sum(1 for day in daygenerator if day.isoweekday() <= 5)
df.withColumn("no_of_weekdays", dateDiffWeekdays(df.end, df.begin)).show()