Why a encoder is needed for creating dataset in spark

Question

I wanted to write the output file in parquet form. For that, I converted the RDD to dataset since from RDD, we cannot get the parquet form directly. And for creating the dataset, we need to use the implicit encoder otherwise, it start giving compile time error. I have few questions in this regard only. Following is my code:

implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[ItemData]
    val ds: Dataset[ItemData] = sparkSession.createDataset(filteredRDD)

    ds.write
      .mode(SaveMode.Overwrite)
      .parquet(configuration.outputPath)
  }

Following are my questions:

Why is it important to use encoder while creating the dataset? And what does this encoder do?
From the above code, when I get the output file in parquet form, I see it in encoded form. How can I decode it? When I decode it using base64 form, I get the following: com.........processor.spark.ItemDat"0156028263

So, basically it is showing me object.toString() kind of value.

For the second question, this could be helpful: https://stackoverflow.com/questions/64184387/how-to-display-or-operate-on-objects-encoded-by-kryo-in-spark-dataset?rq=1 — jack, Oct 07 '20 at 20:45

ulubeyn · Accepted Answer · 2018-12-28T12:36:45.197

From documentation:

createDataset requires an encoder to convert a JVM object of type T to and from the internal Spark SQL representation.

From Heather Miller's course:

Basically, encoders are what convert your data between JVM objects and Spark SQL's specialized internal (tabular) representation. They're required by all Datasets!

Encoders are highly specialized and optimized code generators that generate custom bytecode for serialization and deserialization of your data.

I believe that it is now clear what encoders are and what they do. Regarding to your second question, Kryo serializer leads to Spark storing every row in the dataset as a flat binary object. Instead of using Java or Kryo serializer, you can use Spark's internal encoders. You can use it automatically via spark.implicits._. It also uses less memory than Kryo/Java serialization.

UPDATE I

Based on your comment, here are the things that set Spark Encoders apart from regular Java and Kryo serialization (from Heather Miller's Course):

Limited to and optimal for primitives and case classes, Spark SQL data types.

They contains schema information, which makes these highly optimized code generators possible, and enables optimization based on the shape of the data. Since Spark understands the structure of data in Datasets, it can create a more optimal layout in memory when caching Datasets.

>10x faster than Kryo serialization (Java serialization orders of magnitude slower)

I hope it helps!

This is also possible - just for the record: import spark.implicits._ ; val ds = Seq(1, 2, 3).toDS() ; — thebluephantom, Dec 27 '18 at 20:22
If I am using sparkSession.implicits._ , then data is coming in plain format. But can you tell me what's the difference between this and kyro one and which one is better to use? — hatellla, Dec 28 '18 at 01:08
@ulubeyn sometimes spark.implicits is not enough to define all needed encoders, and for the moment I see only kryo as possible alternative — Ayoub Omari, Nov 04 '20 at 14:02

Why a encoder is needed for creating dataset in spark

1 Answers1

Linked