4

I have encountered a peculiar issue while working with spark, I am not quite sure what is going, would be great if someone can help. My problem is have a function similar to the one below, that is casting dataframes to datasets of sometype, that is decided in runtime. I need to work with Datasets, because the underlying case classes have some annotations that i would like to use.

 def ret(spark: SparkSession, dss: DataFrame, typ: String): Dataset[_ <: Product] = {
    import spark.implicits._
    typ match {
      case "t1" => dss.as[T1]
      case "t2" => dss.as[T2]
    }

  }

I am able to cast a dataframe to dataset with the following function call val ds = ret(spark,dataframe,"t1")

Everything works well with this function, now i want to extend the existing function to return a Dataset[(String,_<:Product)] so i modify my function like this,

 def ret(spark: SparkSession, dss: DataFrame,typ: String):Dataset[(String,_ <: Product)] = {
    import spark.implicits._
    typ match {
      case "t1" => dss.as[(String,T1)]
      case "t2" => dss.as[(String,T2)]
    }
  }

This gives me a compile error saying, type (String,T1), does not match expected type (String,_<:Product). What is actually happening here? any ideas how I can fix this? Any hints would be much appreciated!

Thanks a lot!!

Update: The upper bound <: Product refers to scala.Product and T1,T2 can be any case classes for exampple,

case class T1(name: String, age: Int)

case class T2(name: String, max: Int, min: Int)

But it can be really anything

1 Answers1

4

The common supertype of Dataset[(String, T1)] and Dataset[(String, T2)] is not Dataset[(String,_ <: Product)] but the more complex existential type

Dataset[(String, T)] forSome { type T <: Product }

Dataset[(String,_ <: Product)] is also really an existential type, but a different one; it's a shorthand for

Dataset[(String, T) forSome { type T <: Product }]

Note that to use Dataset[(String, T)] forSome { type T <: Product } without warnings, you need to add import scala.language.existentials (and that these types will be removed in Scala 3).

EDIT: I thought that what I checked would be enough, but apparently type inference fails here and I really don't understand why.

def ret(spark: SparkSession, dss: DataFrame, typ: String): Dataset[(String, T)] forSome { type T <: Product } = {
  import spark.implicits._
  typ match {
    case "t1" => dss.as[(String,T1)]: (Dataset[(String, T)] forSome { type T <: Product })
    case "t2" => dss.as[(String,T2)]: (Dataset[(String, T)] forSome { type T <: Product })
  }
}

does compile as expected. You can extract a type alias to avoid duplication:

type DatasetStringT = Dataset[(String, T)] forSome { type T <: Product }

def ret(spark: SparkSession, dss: DataFrame, typ: String): DatasetStringT = {
  import spark.implicits._
  typ match {
    case "t1" => dss.as[(String,T1)]: DatasetStringT 
    case "t2" => dss.as[(String,T2)]: DatasetStringT 
  }
}
Alexey Romanov
  • 167,066
  • 35
  • 309
  • 487
  • thanks a lot for the answer, but i still get a compile `Error:(70, 114) type mismatch; found : org.apache.spark.sql.Dataset[_1] where type _1 >: (String, T1) with (String, T2) <: (String, Product with Serializable) required: org.apache.spark.sql.Dataset[(String, T)] forSome { type T <: Product } Note: _1 <: (String, Product), but class Dataset is invariant in type T. You may wish to define T as +T instead. (SLS 4.5) def ret(spark: SparkSession, dss: DataFrame, typ: String):Dataset[(String, T)] forSome { type T <: Product } = {` any idea what i might be doing wrong here? – Sai Kiran KrishnaMurthy Nov 10 '19 at 09:08
  • This is because the Dataset API is invariant. Which means that if Something is invariant then even if A is a subclass of B then Something[A] is not a subclass of Something[B]. If you can explain the usecase of what you are trying to achieve there might be other ways of implementing it as you would not be able to modify Dataset API – Jayadeep Jayaraman Nov 10 '19 at 09:22
  • @jjayadeep Yes, but `Dataset[(String, T)] forSome { type T <: Product }` is not a `Dataset[Something]`. – Alexey Romanov Nov 10 '19 at 09:43
  • @AlexeyRomanov - Yes correct. I was just trying to explain invariance to the OP. I was also having trouble understanding why the type system fails here and I had to do something similar to what you have posted as edit. – Jayadeep Jayaraman Nov 10 '19 at 09:48
  • Thanks a lot, Alexey and ijayadeep, just so that i actually understand what is happening here, ` dss.as[(String,T1)]:: DatasetStringT ` what does this statement mean? You are casting dss.as[(String,T1)] to DatasetStringT or is it a hint for the compiler ? – Sai Kiran KrishnaMurthy Nov 10 '19 at 11:08
  • 2
    The statement means that the pattern match returns a `type DatasetStringT` which has been defined above. It is not casting but a way to ensure that the code complies with the type system. – Jayadeep Jayaraman Nov 10 '19 at 11:18
  • 1
    @SaiKiranKrishnaMurthy as jjayadeep says. It can trigger implicit conversions as well, but doesn't here. The technical term is "type ascription" and you can read more here https://stackoverflow.com/questions/2087250/what-is-the-purpose-of-type-ascriptions-in-scala – Alexey Romanov Nov 10 '19 at 11:31
  • Thanks a lot really appreciate it!! :) – Sai Kiran KrishnaMurthy Nov 10 '19 at 11:39