0

I have a Spark DataFrame which contains Strings which I am matching to numeric scores, using a Likert scale. Different question Ids map to different scores. I'm trying to pattern match on a range in Scala within an Apache Spark udf, using this question as a guide:

How can I pattern match on a range in Scala?

But I'm getting a compilation error when I use a range rather than a simple OR statement, i.e.

31 | 32 | 33 | 34 works fine

31 to 35 doesn't compile. Any ideas where I'm going wrong on the syntax please?

Also, in the final case _, I'd like to map to a String rather than an Int, case _ => "None" but this gives an error: java.lang.UnsupportedOperationException: Schema for type Any is not supported

Presumably this is an issue which is generic to Spark, as it's perfectly possible to return Any in native Scala?

Here's my code:

def calculateScore = udf((questionId: Int, answerText: String) => (questionId, answerText) match {

      case ((31 | 32 | 33 | 34 | 35), "Rarely /<br>Never") => 4 //this is fine
      case ((31 | 32 | 33 | 34 | 35), "Occasionally") => 3
      case ((31 | 32 | 33 | 34 | 35), "Often") => 2
      case ((31 | 32 | 33 | 34 | 35), "Almost always /<br>Always") => 1
      case ((x if 41 until 55 contains x), "None of the time") => 1 //this line won't compile
      case _ => 0 //would like to map to "None"
    })

The udf then gets used on a Spark DataFrame, as follows:

val df3 = df.withColumn("NumericScore", calculateScore(df("QuestionId"), df("AnswerText")))
LucieCBurgess
  • 759
  • 5
  • 12
  • 26
  • 2
    The idea of a UDF is that it returns a result that could be used in a SQL statement. So, it needs to be an Int, a String or some other supported type. *Any* doesn't make any sense in the context of SQL. Here, you're doing something essentially similar, just using dataframes rather than SQL directly. Still, if you really want different behavior in your wildcard case (and I don't understand why you would), perhaps you could return -1 or something like that. Alternatively, make the other cases return Strings. – Phasmid Jan 04 '18 at 16:52
  • @Phasmid I'm cleaning up a datafile which I'm performing analytics on. The file is currently formatted as a long list of questions which I'm pivoting the responses to. Some of the responses need to be strings, others Ints, doubles etc. I'll set the schema for each when I've pivoted the data. So I need the output of the column to be flexible - hence the use of Any. Using -1 is a good idea though. – LucieCBurgess Jan 04 '18 at 18:14
  • But I realise that use of Any isn't possible, so I'll use strings rather than ints – LucieCBurgess Jan 04 '18 at 18:24

2 Answers2

2

Guarding expression should be put after the pattern:

def calculateScore = udf((questionId: Int, answerText: String) => (questionId, answerText) match {
  case ((31 | 32 | 33 | 34 | 35), "Rarely /<br>Never") => 4 
  case ((31 | 32 | 33 | 34 | 35), "Occasionally") => 3
  case ((31 | 32 | 33 | 34 | 35), "Often") => 2
  case ((31 | 32 | 33 | 34 | 35), "Almost always /<br>Always") => 1
  case (x, "None of the time") if 41 until 55 contains x => 1
  case _ => 0 //would like to map to "None"
})
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
2

If you want to map the last case i.e. case _ to "None" String, then all of the case should return String as well

Following udf function should work for you

def calculateScore  = udf((questionId: Int, answerText: String) => (questionId, answerText) match {
  case ((31 | 32 | 33 | 34 | 35), "Rarely /<br>Never") => "4" //this is fine
  case ((31 | 32 | 33 | 34 | 35), "Occasionally") => "3"
  case ((31 | 32 | 33 | 34 | 35), "Often") => "2"
  case ((31 | 32 | 33 | 34 | 35), "Almost always /<br>Always") => "1"
  case (x, "None of the time") if (x >= 41 && x < 55) => "1" //this line won't compile
  case _ => "None"
})

If you want to map the last case i.e. case _ to None, then you would need to change the other return types as child of Option as None is child of Option

Following code should also work for you

def calculateScore  = udf((questionId: Int, answerText: String) => (questionId, answerText) match {
  case ((31 | 32 | 33 | 34 | 35), "Rarely /<br>Never") => Some(4) //this is fine
  case ((31 | 32 | 33 | 34 | 35), "Occasionally") => Some(3)
  case ((31 | 32 | 33 | 34 | 35), "Often") => Some(2)
  case ((31 | 32 | 33 | 34 | 35), "Almost always /<br>Always") => Some(1)
  case (x, "None of the time") if (x >= 41 && x < 55) => Some(1) //this line won't compile
  case _ => None
})

The final point is that the error message you have java.lang.UnsupportedOperationException: Schema for type Any is not supported clearly states that udf function with return type Any is not supported. All the return types from the match cases should be consistent.

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97