If I have a simple Scala collection of Ints and I define a simple method isPositive
to return true if the value is greater than 0, then I can just pass the method to the filter
method of the collection, like in the example below
def isPositive(i: Int): Boolean = i > 0
val aList = List(-3, -2, -1, 1, 2, 3)
val newList = aList.filter(isPositive)
> newList: List[Int] = List(1, 2, 3)
So as far as I understand, the compiler is able to automatically convert the method into a function instance by doing eta expansion, and then it passes this function as parameter.
However, if I do the same thing with a Spark Dataset:
val aDataset = aList.toDS
val newDataset = aDataset.filter(isPositive)
> error
It fails with the well-known "missing arguments for method" error. To make it work, I have to explicitly convert the method into a function by using "_":
val newDataset = aDataset.filter(isPositive _)
> newDataset: org.apache.spark.sql.Dataset[Int] = [value: int]
Although with map
it works as expected:
val newDataset = aDataset.map(isPositive)
> newDataset: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]
Investigating the signatures, I see that the signature for Dataset's filter is very similar to List's filter:
// Dataset:
def filter(func: T => Boolean): Dataset[T]
// List (Defined in TraversableLike):
def filter(p: A => Boolean): Repr
So, why isn't the compiler doing eta expansion for the Dataset's filter operation?