0

I'm new in Scala

Here's what I'm trying to understand

This code snippet gives me RDD[Int], not give option to use toDF

var input = spark.sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9))

But when I import import spark.sqlContext.implicits._, it gives me an option to use toDF

import spark.sqlContext.implicits._
var input = spark.sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9)).toDF

So I looked into the source code, implicits is present in SQLContext class as object. I'm not able to understand, how come RDD instance is able to call toDF after import ?

Can anyone help me to understand ?

update

found below code snippet in SQLContext class

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

  object implicits extends SQLImplicits with Serializable {
    protected override def _sqlContext: SQLContext = self
  }
Javastudent
  • 29
  • 10

1 Answers1

3

toDF is an extension method. With the import you bring necessary implicits to the scope.

For example Int doesn't have method foo

1.foo() // doesn't compile

But if you define an extension method and import implicit

object implicits {
  implicit class IntOps(i: Int) {
    def foo() = println("foo")
  }
}

import implicits._
1.foo() // compiles

The compiler transforms 1.foo() into new IntOps(1).foo().

Similarly,

object implicits extends SQLImplicits ...

abstract class SQLImplicits ... {
  ...

  implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = {
    DatasetHolder(_sqlContext.createDataset(rdd))
  }

  implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = {
    DatasetHolder(_sqlContext.createDataset(s))
  }
}

case class DatasetHolder[T] private[sql](private val ds: Dataset[T]) {

  def toDS(): Dataset[T] = ds

  def toDF(): DataFrame = ds.toDF()

  def toDF(colNames: String*): DataFrame = ds.toDF(colNames : _*)
}

import spark.sqlContext.implicits._ transforms spark.sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9)).toDF into rddToDatasetHolder(spark.sparkContext.parallelize...).toDF i.e. DatasetHolder(_sqlContext.createDataset(spark.sparkContext.parallelize...)).toDF.

You can read about implicits, extension methods in Scala

Understanding implicit in Scala

Where does Scala look for implicits?

Understand Scala Implicit classes

https://docs.scala-lang.org/overviews/core/implicit-classes.html

https://docs.scala-lang.org/scala3/book/ca-extension-methods.html

https://docs.scala-lang.org/scala3/reference/contextual/extension-methods.html

How extend a class is diff from implicit class?


About spark.implicits._

Importing spark.implicits._ in scala

What is imported with spark.implicits._?

import implicit conversions without instance of SparkSession

Workaround for importing spark implicits everywhere

Why is spark.implicits._ is embedded just before converting any rdd to ds and not as regular imports?

Dmytro Mitin
  • 48,194
  • 3
  • 28
  • 66
  • before implicit its a RDD instance, so after import implicits._, Does all methods including toDF of implicits added to RDD's instance ? – Javastudent Apr 17 '23 at 11:16
  • @Javastudent See update. Is it clearer? – Dmytro Mitin Apr 17 '23 at 11:20
  • If SQLImplicits has two implicits and both are taking RDD as a parameter, how it'll choose ? – Javastudent Apr 17 '23 at 11:30
  • 1
    @Javastudent You should read about implicits in Scala https://stackoverflow.com/questions/5598085/where-does-scala-look-for-implicits You can make implicits higher- or lower-priority. Among implicits compiler prefers more specific. If it can't choose you'll have compile error (`ambiguous implicits`). For example https://scastie.scala-lang.org/DmytroMitin/SrRDK6FoSCm7mUBxJcQWWw but https://scastie.scala-lang.org/DmytroMitin/SrRDK6FoSCm7mUBxJcQWWw/1 – Dmytro Mitin Apr 17 '23 at 11:37
  • one last thing, import means...we are looking for implicit, makes sense...so the moment we have more methods for existing class after having implicits, how compiler links to these new methods to existing class, I mean does it creates new class with same name with more methods ? Or its just a scala compiler's functionality or style of doing this ? – Javastudent Apr 17 '23 at 11:48
  • 1
    @Javastudent Rewriting `spark.sparkContext.parallelize(...).toDF` into `rddToDatasetHolder(spark.sparkContext.parallelize(...)).toDF` is done by compiler – Dmytro Mitin Apr 17 '23 at 13:06