I have this scenario. We have to provide a functionality that takes a whatever type of RDD
, with the generics notation you could say RDD[T]
and serialize and save to HDFS using Avro DataFile
.
Beware that the RDD could be of anything, so the functionality should be generic to the given RDD type, for example, RDD[(String, AnyBusinessObject)]
o RDD[(String, Date, OtherBusinessObject)]
.
The question is: how can we infer the Avro schema and provide Avro serialization for a whatever class type in order to save it as Avro Data File?
The functionality is actually already built, but it uses Java Serialization, this obviously causes space and time penalty, so we would like to refactor it. We can't use DataFrames.