Under the hood, a dataset is an RDD. From the documentation for RDD persistence:
Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
By default, Java serialization is used source:
By default, Spark serializes objects using Java’s ObjectOutputStream framework... Spark can also use the Kryo library (version 2) to serialize objects more quickly.
To enable Kryo, initialize the job with a SparkConf and set spark.serializer
to org.apache.spark.serializer.KryoSerializer
:
val conf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)
You may need to register classes with Kryo before creating the SparkContext:
conf.registerKryoClasses(Array(classOf[Class1], classOf[Class2]))