4

If I have a simple Scala collection of Ints and I define a simple method isPositive to return true if the value is greater than 0, then I can just pass the method to the filter method of the collection, like in the example below

def isPositive(i: Int): Boolean = i > 0

val aList = List(-3, -2, -1, 1, 2, 3)
val newList = aList.filter(isPositive)

> newList: List[Int] = List(1, 2, 3)

So as far as I understand, the compiler is able to automatically convert the method into a function instance by doing eta expansion, and then it passes this function as parameter.

However, if I do the same thing with a Spark Dataset:

val aDataset = aList.toDS
val newDataset = aDataset.filter(isPositive)

> error

It fails with the well-known "missing arguments for method" error. To make it work, I have to explicitly convert the method into a function by using "_":

val newDataset = aDataset.filter(isPositive _)

> newDataset: org.apache.spark.sql.Dataset[Int] = [value: int]

Although with map it works as expected:

val newDataset = aDataset.map(isPositive)

> newDataset: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]

Investigating the signatures, I see that the signature for Dataset's filter is very similar to List's filter:

// Dataset:
def filter(func: T => Boolean): Dataset[T]

// List (Defined in TraversableLike):
def filter(p: A => Boolean): Repr

So, why isn't the compiler doing eta expansion for the Dataset's filter operation?

Daniel de Paula
  • 17,362
  • 9
  • 71
  • 72
  • Maybe because `filter` in Scala TraversableLike is not overloaded and in Spark Dataset there are many options – T. Gawęda Aug 09 '17 at 13:31
  • @T.Gawęda The same stands for `map`, but it works as expected for Datasets. That's why I added the example with map. – Daniel de Paula Aug 09 '17 at 13:33
  • You are right! So waiting for real answer :) – T. Gawęda Aug 09 '17 at 13:38
  • 2
    Wait, but map does have one additional implicit parameter, encoder – T. Gawęda Aug 09 '17 at 13:40
  • @T.Gawęda I see... the "Java" version for map has the encoder as a second parameter, so the compiler only finds one option for a single-parameter `map` and therefore it's able to do the conversion. However, for filter, it finds multiple candidates with a single parameter. I'll do some tests to verify if that's the reason. – Daniel de Paula Aug 09 '17 at 13:45

1 Answers1

4

This is due to the nature of overloaded methods and ETA expansion. Eta-expansion between methods and functions with overloaded methods in Scala explains why this fails.

The gist of it is the following (emphasis mine):

when overloaded, applicability is undermined because there is no expected type (6.26.3, infamously). When not overloaded, 6.26.2 applies (eta expansion) because the type of the parameter determines the expected type. When overloaded, the arg is specifically typed with no expected type, hence 6.26.2 doesn't apply; therefore neither overloaded variant of d is deemed to be applicable.

.....

Candidates for overloading resolution are pre-screened by "shape". The shape test encapsulates the intuition that eta-expansion is never used because args are typed without an expected type. This example shows that eta-expansion is not used even when it is "the only way for the expression to type check."

As @DanielDePaula points out, the reason we don't see this effect in DataSet.map is because the overloaded method actually takes an additional Encoder[U] parameter:

def map[U : Encoder](func: T => U): Dataset[U] = withTypedPlan {
  MapElements[T, U](func, logicalPlan)
}

def map[U](func: MapFunction[T, U], encoder: Encoder[U]): Dataset[U] = {
  implicit val uEnc = encoder
  withTypedPlan(MapElements[T, U](func, logicalPlan))
}
Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
  • The map function has similar interface for Java, but it works. Are you sure it depends on the Java interface? – T. Gawęda Aug 09 '17 at 14:02
  • Actually I've just found out that *any* overload makes the method lose the "expected type", so even if there was only `filter(expr: String)` and `filter(f: T => Boolean)` it wouldn't work. ([source](https://stackoverflow.com/questions/17324247/eta-expansion-between-methods-and-functions-with-overloaded-methods-in-scala)) – Daniel de Paula Aug 09 '17 at 14:02
  • @DanieldePaula You're right, https://stackoverflow.com/questions/17324247/eta-expansion-between-methods-and-functions-with-overloaded-methods-in-scala. I'll update the answer. – Yuval Itzchakov Aug 09 '17 at 14:04
  • 1
    for map, the problem doesn't happen because there is not a "real" overload, since the alternative method has two parameters (the function and the encoder) – Daniel de Paula Aug 09 '17 at 14:06
  • Yup, so it not depends on Java interface, but only on overloading :) – T. Gawęda Aug 09 '17 at 14:06
  • Definitely! I was wondering this myself lately when converting an application to Structured Streaming but I forgot to check. – Yuval Itzchakov Aug 09 '17 at 14:13