I am using Spark 2.4 and referring to https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence
Bean class:
public class EmployeeBean implements Serializable {
private Long id;
private String name;
private Long salary;
private Integer age;
// getters and setters
}
Spark Example:
SparkSession spark = SparkSession.builder().master("local[4]").appName("play-with-spark").getOrCreate();
List<EmployeeBean> employees1 = populateEmployees(1, 1_000_000);
Dataset<EmployeeBean> ds1 = spark.createDataset(employees1, Encoders.kryo(EmployeeBean.class));
Dataset<EmployeeBean> ds2 = spark.createDataset(employees1, Encoders.bean(EmployeeBean.class));
ds1.persist(StorageLevel.MEMORY_ONLY());
long ds1Count = ds1.count();
ds2.persist(StorageLevel.MEMORY_ONLY());
long ds2Count = ds2.count();
I looked for storage in Spark Web UI. Useful part -
ID RDD Name Size in Memory
2 LocalTableScan [value#0] 56.5 MB
13 LocalTableScan [age#6, id#7L, name#8, salary#9L] 23.3 MB
Few questions:
Shouldn't size of Kryo serialized RDD be less than Java serialized RDD instead of more than double size?
I also tried
MEMORY_ONLY_SER()
mode and RDDs size are the same. RDD as serialized Java objects should be stored as one byte array per partition. Shouldn't the size of persisted RDDs be less than deserialized ones?What exactly is adding Kryo and bean encoders are doing while creating Dataset?
Can I rename persisted RDDs for better readability?