1

Why is it that I have to cast elements of RDD[Int] to Int or String to use it with sortBy (Spark 1.6)?

For e.g. this gives me an error

val t = sc.parallelize(1 to 9) //t: org.apache.spark.rdd.RDD[Int]
t.sortBy(_, ascending=false) //error: missing parameter type ...

whereas this works

t.sortBy(_.toInt, ascending=false).collect //Array[Int] = Array(9, 8, 7, 6, 5, 4, 3, 2, 1)

t.sortBy(_.toString,ascending=false).collect //Array[Int] = Array(9, 8, 7, 6, 5, 4, 3, 2, 1)

And why is it casting to toString above returns Array[Int] instead of Array[String] or Array[Char] ?

Just started learning spark, so please go easy on me :-).

Bala
  • 11,068
  • 19
  • 67
  • 120
  • 1
    You're collecting an `RDD[Int]` in both cases, so that's the return type. Regarding the sorting, the `_` is untyped, so it doesn't know how to sort – OneCricketeer Nov 05 '17 at 15:31

1 Answers1

2

This is a matter of scope, as described here Underscore in Named Arguments and hinted by the error message:

missing parameter type for expanded function 
  ((x$1) => t.sortBy(x$1, ascending = false)) 

You can use identity instead:

t.sortBy(identity, ascending=false)

And why is it casting to toString above returns Array[Int] instead of Array[String] or Array[Char] ?

Because _.toString is used only for comparing, not to transform the data.