I have a Kafka topic where Protobuf messages arive. I want to store them in blob storage and then be able to query them. Due to the volume of data I want to do this with Spark in order to have a scalable solution. The problem is that Spark does not have native support for Protobuf. I can think of two possible solutions:
- Store the raw data as a binary column in some format such as Parquet. When querying, use a UDF to parse the raw data.
- Convert the Protobuf to Parquet (this should map 1:1) on write. Then reading/querying with Spark becomes trivial.
I have managed to implement (2) using plain Java using the org.apache.parquet.proto.ProtoParquetWriter
, i.e.
ProtoParquetWriter w = new ProtoParquetWriter(file, MyProtobufClass.class);
for (Object record : messages) {
w.write(record);
}
w.close();
The problem is how to implement this using Spark?
There is this sparksql-protobuf project but it hasn't been updated in years.
I have found this related question but it is 3,5 years old and has no answer.