I have a user defined function as follows which I want to use to derive new columns in my dataframe:
def to_date_formatted(date_str, format):
if date_str == '' or date_str is None:
return None
try:
dt = datetime.datetime.strptime(date_str, format)
except:
return None
return dt.date()
spark.udf.register("to_date_udf", to_date_formatted, DateType())
I can use this by running sql like select to_date_udf(my_date, '%d-%b-%y') as date
. Note the ability to pass a custom format as an argument to the function
However, I'm struggling to use it using pyspark column expression syntax, rather than sql
I want to write something like:
df.with_column("date", to_date_udf('my_date', %d-%b-%y')
But this results in an error. How can I do this?
[Edit: In this specific example, in Spark 2.2+ you can provide an optional format argument with the built in to_date
function. I'm on Spark 2.0 at the moment, so this is not possible for me. Also worth noting I provided this as an example, but I'm interested in the general syntax for providing arguments to UDFs, rather than the specifics of date conversion]