I have a data pipeline that writes protobufs into an HDFS and now I need a way to query that data. I stumbled upon elephant-bird and hive and have been trying to get this solution up-an-running for a day now.
Here are the steps that I took:
1.) Installed Hadoop 2.7.3, Hive 2.1.1 and Protobuf 3.0.0
2.) Cloned Elephant-Bird 4.16 and built was successful
3.) Start hive and add the core, hive and hadoop-compat jars
4.) Generate java class for .proto file; package with protobuf-java-3.0.0.jar and add to hive
5.) Add protobuf-java-3.0.0.jar to hive
After all of this I execute a create external command as follows:
create external table tracks
row format serde
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
with serdeproperties (
"serialization.class"="protobuf.TracksProtos$Env")
stored as
inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
LOCATION '/tracks/';
And I receive this message in the logs:
2017-10-26T17:36:30,838 ERROR [main] util.Protobufs: Error invoking method getDescriptor in class class protobuf.TracksProtos$Env
java.lang.reflect.InvocationTargetException
.....
.....
.....
Caused by: java.lang.NoSuchMethodError: com.google.protobuf.Descriptors$Descriptor.getOneofs()Ljava/util/List;
I know this is not true because I can list jars from hive and see the all were installed and when I expand them I can see the classes that they believe do not exist.
If I look under $HIVE_HOME/lib I see that it is using protobuf-java-2.5.0.jar. I am wondering if this is the cause for this error and my options to correct it.
Thoughts ?