2

I have a sequence file that value is proto3 encoded byte array.

I looked into elephant-bird, which is very old and only support proto 2.x version. https://github.com/kevinweil/elephant-bird

Also it stops releasing new package and the latest one is already a couple of years old, so I don't think it is working anymore.

And I assume I am not the only one that runs into this issue, so here is the scenario.

I wrote an application to generate a sequence file with each (key, record), key is irrelevant, value is proto3 encoded byte array. When my app generated the file, it doesn't know/need to know the schema of the proto, it only takes in the byte array, put into the sequence file.

When I want to create a table in Hive so that I can query the data, I want to provide Hive with some infomation so that Hive can correctly create the table.

Elephant-bird gave the example as follow: https://github.com/twitter/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive#reading-protocol-buffers

create table users
 row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
with serdeproperties (
 "serialization.class"="com.example.proto.gen.Storage$User")
stored as
 inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat";

But since it is very old, for Hive 2.4.6 and proto3, is there some equivalent solution that someone can point me to?

Thank you.

0 Answers0