I have a Dataset like this:
+----+------+
|code|status|
+-----------+
| 1| "new"|
| 2| null|
| 3| null|
+----+------+
I would like to write a UDF that depends on both columns.
I got it working following the second approach in this answer which is to handle null
outside the UDF, and write myFn
to take a Boolean as the second parameter:
df.withColumn("new_column",
when(pst_regs("status").isNull,
myFnUdf($"code", lit(false))
)
.otherwise(
myFnUdf($"code", lit(true))
)
)
To handle null in the UDF an approach I looked at is per this answer which talks about "wrapping arguments with Options
". I tried code like this:
df.withColumn("new_column", myFnUdf($"code", $"status"))
def myFn(code: Int, status: String) = (code, Option(status)) match {
case (1, "new") => "1_with_new_status"
case (2, Some(_)) => "2_with_any_status"
case (3, None) => "3_no_status"
}
But a row with null
gives type mismatch; found :None.type required String
. I also tried wrapping an argument with Option
during udf creation without success. The basic form of this (without Option) looks like this:
myFnUdf = udf[String, Int, String](myFn(_:Int, _:String))
I'm new to Scala so I'm sure I'm missing something simple. Part of my confusion may be the different syntaxes for creating udfs from functions (e.g. per https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html), so I'm not sure I'm using the best way. Any help appreciated!
EDIT
Edited to add missing (1, "new")
case per @user6910411 and @sgvd comments.