0

I am learing spark datasets and checking how can we convert an rdd to a dataset.

For this, i got the following code:

val spark = SparkSession
      .builder
      .appName("SparkSQL")
      .master("local[*]")
      .getOrCreate()

    val lines = spark.sparkContext.textFile("../myfile.csv")
    val structuredData = lines.map(mapperToConvertToStructureData)

    import spark.implicits._
    val someDataset = structuredData.toDS

Here if we want to convert an rdd to dataset, we need import spark.implicits._ just before the conversion.

Why is this written just before the conversion? Can we use this import as regular imports as we do on the top of the file?

KayV
  • 12,987
  • 11
  • 98
  • 148

2 Answers2

4

Here spark in an instance of class org.apache.spark.sql.SparkSession so the instance must exist before importing from it.

ollik1
  • 4,460
  • 1
  • 9
  • 20
0

Spark implicits are required to work with datasets because it's the place that all the implicit functions and classes that are needed for the Encoders are found. Encoders are needed for all transformations to datasets. Take a look at the documentation and you will see in all dataset transformation, you have a "A : Encoder" bound or an Encoder implicit.

In scala normally this implicits are in {objects} but in spark they are inside the sparkSession class, so until you don't have an instance, you cant import them.

Alfilercio
  • 1,088
  • 6
  • 13