using numpy busday_count function in pyspark

Question

Given I have two columns in pyspark that are not null:

df2 = sqlContext.sql("SELECT AssignedDate,primaryid from parts_so where AssignedDate is not null ")
df2 = df2.withColumn('sdate', df2.AssignedDate )
df2 = df2.withColumn('edate',current_timestamp() )

When I run df2 = df2.withColumn('days_outstanding', lit(get_hours2(df2.sdate,df2.edate ) )

def get_hours2(sdate,edate):

    biz_days = np.busday_count(sdate,edate)

    return biz_days

I get the following error:

object of type 'Column' has no len()

[numpy.busday_count](https://docs.scipy.org/doc/numpy/reference/generated/numpy.busday_count.html) expects array in input.. i think your date columns are not array. If it's an array you have to iterate over it.. also pyspark provides function to subtract dates. you can read [here](https://stackoverflow.com/questions/44020818/how-to-calculate-date-difference-in-pyspark) — rock321987, Dec 03 '19 at 05:28

lukaszKielar · Accepted Answer · 2019-12-03T08:18:03.447

First of all let's create some random data

import datetime
import random
import pandas as pd
import numpy as np

sdate = [datetime.datetime.now() + datetime.timedelta(i) for i in range(5)]
edate = [date + datetime.timedelta(random.random()+3) for date in sdate]
data = {
    'sdate': sdate,
    'edate': edate
}
pdf = pd.DataFrame(data)
df = spark.createDataFrame(pdf)
df.show()

+--------------------+--------------------+
|               edate|               sdate|
+--------------------+--------------------+
|2019-12-06 22:55:...|2019-12-03 08:14:...|
|2019-12-07 19:42:...|2019-12-04 08:14:...|
|2019-12-08 21:26:...|2019-12-05 08:14:...|
|2019-12-09 18:57:...|2019-12-06 08:14:...|
|2019-12-11 04:08:...|2019-12-07 08:14:...|
+--------------------+--------------------+

You cannot use bare function to create another column in pyspark. We have to create UDF in order to do that.

NOTE: Please remember that you have to cast the result of the computation to int, because you might get a problem with pickling numpy type.

import pyspark.sql.types as T
import pyspark.sql.functions as F

@F.udf(returnType=T.IntegerType())
def get_hours2(sdate,edate):
    biz_days = np.busday_count(sdate,edate)
    return int(biz_days)

Finally we can use UDF on created DataFrame.

df = df.withColumn('days_outstanding', F.lit(get_hours2('sdate', 'edate')))
df.show()

+--------------------+--------------------+----------------+
|               edate|               sdate|days_outstanding|
+--------------------+--------------------+----------------+
|2019-12-06 22:55:...|2019-12-03 08:14:...|               3|
|2019-12-07 19:42:...|2019-12-04 08:14:...|               3|
|2019-12-08 21:26:...|2019-12-05 08:14:...|               2|
|2019-12-09 18:57:...|2019-12-06 08:14:...|               1|
|2019-12-11 04:08:...|2019-12-07 08:14:...|               2|
+--------------------+--------------------+----------------+

I hope this helps you.

What happens in the case you need to use a numpy function that returns a string or date? For example I'm trying to apply your answer to date_1st_wed_of_month = np.busday_offset(date_to_convert, 0, roll='forward', weekmask='Mon') but I get a pickle error like you said would happen! — data101, Jul 25 '23 at 17:15
You have to cast return type to string and it should work just fine. — lukaszKielar, Aug 29 '23 at 10:12

using numpy busday_count function in pyspark

1 Answers1