1

I have a pyspark dataframe df

+------------+------+
|  timestamp | days |
+------------+------+
| 2019-11-21 |    5 |
| 2019-10-22 |   21 |
|        ... |  ... |
+------------+------+

I want to subtract the days from the timestamp with

import pyspark.sql.functions as F

df.withColumn("timestamp", F.date_add(F.col("timestamp"), -F.col("days")))

Expected result would be

+------------+------+
|  timestamp | days |
+------------+------+
| 2019-11-16 |    5 |
| 2019-10-01 |   21 |
|        ... |  ... |
+------------+------+

But I only get an error TypeError: Column is not iterable

Is there a way to get this to work?

jho
  • 725
  • 1
  • 6
  • 12

1 Answers1

1

Using a udf was the solution.

date_add_udf = F.udf(lambda date, days: F.date_add(date, days), pyspark.sql.types.TimestampType())

And then calling the it

df.withColumn("timestamp", date_add_udf(F.col("timestamp"), -F.col("days")))
jho
  • 725
  • 1
  • 6
  • 12
  • When I try this method, I get an error like in [this post](https://stackoverflow.com/questions/53751266/attributeerror-nonetype-object-has-no-attribute-jvm-pyspark-udf). It looks like pyspark.sql.functions are not allowed inside of UDFs. Is this particular to Databricks? – Rachel Oct 19 '21 at 18:51
  • Just found [an alternate answer](https://newbedev.com/how-to-subtract-a-column-of-days-from-a-column-of-dates-in-pyspark). You use expr() instead of a UDF. – Rachel Oct 19 '21 at 18:55