8

I have a dataframe df of columns ("id", "current_date", "days") and I am trying to add the the "days" to "current_date" and create a new dataframe with new column called "new_date" using spark scala function date_add()

val newDF = df.withColumn("new_Date", date_add(df("current_date"), df("days").cast("Int")))

But looks like the function date_add only accepts Int values and not columns. How can get the desired output in such case? Are there any alternative functions i can use to get the desired output?

spark version: 1.6.0 scala version: 2.10.6

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
qubiter
  • 235
  • 1
  • 5
  • 19

2 Answers2

19

No need to use an UDF, you can do it using an SQL expression:

val newDF = df.withColumn("new_date", expr("date_add(current_date,days)"))
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • you can even skip `expr` and directly use (in spark 2.1+) `val newDF = df.withColumn("new_date", date_add($"current_date",days))` – greg hor Feb 13 '18 at 10:56
  • I haven't found a way of making this syntax (`val newDF = df.withColumn("new_date", date_add($"current_date",days))`) work. As of today, in Spark 2.2, the only options I can make work are a UDF or `expr()` (and I prefer the UDF). See https://stackoverflow.com/questions/50222781/use-existing-column-in-date-add-spark-function?noredirect=1#comment87463581_50222781 for more info. – jamiet May 07 '18 at 22:12
  • Thanks for the solution, but when I'm trying the code you mentioned above column is generating but the values in it are null. here is piece of code. `val dt3 = dt2.withColumn("new_date", expr("date_add('Week Start',dummy)"))` could you please help me with this. – Dinesh May 11 '18 at 11:17
5

A small custom udf can be used to make this date arithmetic possible.

import org.apache.spark.sql.functions.udf
import java.util.concurrent.TimeUnit
import java.util.Date
import java.text.SimpleDateFormat    

val date_add = udf((x: String, y: Int) => {
    val sdf = new SimpleDateFormat("yyyy-MM-dd")
    val result = new Date(sdf.parse(x).getTime() + TimeUnit.DAYS.toMillis(y))
  sdf.format(result)
} )

Usage:

scala> val df = Seq((1, "2017-01-01", 10), (2, "2017-01-01", 20)).toDF("id", "current_date", "days")
df: org.apache.spark.sql.DataFrame = [id: int, current_date: string, days: int]

scala> df.withColumn("new_Date", date_add($"current_date", $"days")).show()
+---+------------+----+----------+
| id|current_date|days|  new_Date|
+---+------------+----+----------+
|  1|  2017-01-01|  10|2017-01-11|
|  2|  2017-01-01|  20|2017-01-21|
+---+------------+----+----------+
rogue-one
  • 11,259
  • 7
  • 53
  • 75