5

If I have a dataset each record of which is a case class, and I persist that dataset as shown below so that serialization is used:

myDS.persist(StorageLevel.MERORY_ONLY_SER)

Does Spark use java/kyro serialization to serialize the dataset? or just like dataframe, Spark has its own way of storing the data in the dataset?

3 Answers3

12

Spark Dataset does not use standard serializers. Instead it uses Encoders, which "understand" internal structure of the data and can efficiently transform objects (anything that have Encoder, including Row) into internal binary storage.

The only case where Kryo or Java serialization is used, is when you explicitly apply Encoders.kryo[_] or Encoders.java[_]. In any other case Spark will destructure the object representation and try to apply standard encoders (atomic encoders, Product encoder, etc.). The only difference compared to Row is its Encoder - RowEncoder (in a sense Encoders are similar to lenses).

Databricks explicitly puts Encoder / Dataset serialization in contrast to Java and Kryo serializers, in its Introducing Apache Spark Datasets (look especially for Lightning-fast Serialization with Encoders section)

enter image description here

enter image description here

Source of the images

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • 1
    this is exactly correct and it's always almost recommend to use Kyro serde lib. since it's much faster than Java serde and perf improvement is a lot. it also seamlessly integrate with Java serde framework. so java.io.Serializable is the only thing needed from Kyro to serialize your custom data structures. to see concrete example: https://stackoverflow.com/questions/53329178/spark-no-encoder-found-for-java-io-serializable-in-mapstring-java-io-serializa/53356578#53356578 – linehrr Nov 18 '18 at 07:21
3

Dataset[SomeCaseClass] is not different from Dataset[Row] or any other Dataset. It uses the same internal representation (mapped to instances of external class when needed) and the same serialization method.

Therefore, the is no need for direct object serialization (Java, Kryo).

  • In recent versions of Spark, we mainly have DataSet and Dataframe, with Dataframe being just a special case of Dataset. So, if serialization has no effect on Datasets, then why do Spark developers push for Kyro? So, I'm not sure what you say above is accurate. I think if Dataframe records are objects, then these objects are serialized. So, the Dataset itself might not use serialization, but the objects are serialized. –  Dec 26 '17 at 22:12
-2

Under the hood, a dataset is an RDD. From the documentation for RDD persistence:

Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

By default, Java serialization is used source:

By default, Spark serializes objects using Java’s ObjectOutputStream framework... Spark can also use the Kryo library (version 2) to serialize objects more quickly.

To enable Kryo, initialize the job with a SparkConf and set spark.serializer to org.apache.spark.serializer.KryoSerializer:

val conf = new SparkConf()
             .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)

You may need to register classes with Kryo before creating the SparkContext:

conf.registerKryoClasses(Array(classOf[Class1], classOf[Class2]))
Levi Ramsey
  • 18,884
  • 1
  • 16
  • 30