I have a sequence file that value is proto3 encoded byte array.
I looked into elephant-bird, which is very old and only support proto 2.x version. https://github.com/kevinweil/elephant-bird
Also it stops releasing new package and the latest one is already a couple of years old, so I don't think it is working anymore.
And I assume I am not the only one that runs into this issue, so here is the scenario.
I wrote an application to generate a sequence file with each (key, record), key is irrelevant, value is proto3 encoded byte array. When my app generated the file, it doesn't know/need to know the schema of the proto, it only takes in the byte array, put into the sequence file.
When I want to create a table in Hive so that I can query the data, I want to provide Hive with some infomation so that Hive can correctly create the table.
Elephant-bird gave the example as follow: https://github.com/twitter/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive#reading-protocol-buffers
create table users
row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
with serdeproperties (
"serialization.class"="com.example.proto.gen.Storage$User")
stored as
inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat";
But since it is very old, for Hive 2.4.6 and proto3, is there some equivalent solution that someone can point me to?
Thank you.