I wanted to write the output file in parquet form. For that, I converted the RDD to dataset since from RDD, we cannot get the parquet form directly. And for creating the dataset, we need to use the implicit encoder otherwise, it start giving compile time error. I have few questions in this regard only. Following is my code:
implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[ItemData]
val ds: Dataset[ItemData] = sparkSession.createDataset(filteredRDD)
ds.write
.mode(SaveMode.Overwrite)
.parquet(configuration.outputPath)
}
Following are my questions:
- Why is it important to use encoder while creating the dataset? And what does this encoder do?
- From the above code, when I get the output file in parquet form, I see it in encoded form. How can I decode it? When I decode it using base64 form, I get the following: com.........processor.spark.ItemDat"0156028263
So, basically it is showing me object.toString() kind of value.