How to use Option with Spark UDF

Question

I have a Dataset like this:

+----+------+
|code|status|
+-----------+
|   1| "new"|
|   2|  null|
|   3|  null|
+----+------+

I would like to write a UDF that depends on both columns.

I got it working following the second approach in this answer which is to handle null outside the UDF, and write myFn to take a Boolean as the second parameter:

df.withColumn("new_column",
  when(pst_regs("status").isNull, 
    myFnUdf($"code", lit(false))
  )
  .otherwise(
    myFnUdf($"code", lit(true))
  )
)

To handle null in the UDF an approach I looked at is per this answer which talks about "wrapping arguments with Options". I tried code like this:

df.withColumn("new_column", myFnUdf($"code", $"status"))

def myFn(code: Int, status: String) = (code, Option(status)) match {
  case (1, "new") => "1_with_new_status"
  case (2, Some(_)) => "2_with_any_status"
  case (3, None) => "3_no_status"
}

But a row with null gives type mismatch; found :None.type required String. I also tried wrapping an argument with Option during udf creation without success. The basic form of this (without Option) looks like this:

myFnUdf = udf[String, Int, String](myFn(_:Int, _:String))

I'm new to Scala so I'm sure I'm missing something simple. Part of my confusion may be the different syntaxes for creating udfs from functions (e.g. per https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html), so I'm not sure I'm using the best way. Any help appreciated!

EDIT

Edited to add missing (1, "new") case per @user6910411 and @sgvd comments.

I think you could make the myFn example work by saying `case (1,Some("new")) => "1_with_new_status"` so that all cases match the same tuple type — conny, May 18 '18 at 08:28

score 7 · Accepted Answer · answered Dec 15 '16 at 10:21

Firstly, there may be some code you are using that we are missing here. When I try your example myFn, made into a UDF with val myFnUdf = udf(myFn _) and run it with df.withColumn("new_column", myFnUdf($"code", $"status")).show, I don't get a type mismatch, but instead a MatchError, as also noted by user6910411. This is becausethere is no pattern to match (1, "new").

Aside from that, although it is usually better to use Scala's Options rather than raw null values, you don't have to in this case. The following example works with null directly:

val my_udf = udf((code: Int, status: String) => status match {
    case null => "no status"
    case _ => "with status"
})

df.withColumn("new_column", my_udf($"code", $"status")).show

Result:

+----+------+-----------+
|code|status| new_column|
+----+------+-----------+
|   1|   new|with status|
|   2|  null|  no status|
|   2|  null|  no status|
+----+------+-----------+

Wrapping with options does still work though:

val my_udf = udf((code: Int, status: String) => Option(status) match {
    case None => "no status"
    case Some(_) => "with status"
})

This gives the same result.

Thanks @sgvd. I got it working with both methods (and updated the question to include the missing case). Appreciate your help. — Derek Hill, Dec 15 '16 at 13:40

How to use Option with Spark UDF

1 Answers1