For a project of mine, I want to analyse around 2 TB of Protobuf objects. I want to consume these objects in a Pig Script via the "elephant bird" library. However it is not totally clear to my how to write a file to HDFS so that it can be consumed by the ProtobufPigLoader class.
This is what I have:
Pig script:
register ../fs-c/lib/*.jar // this includes the elephant bird library
register ../fs-c/*.jar
raw_data = load 'hdfs://XXX/fsc-data2/XXX*' using com.twitter.elephantbird.pig.load.ProtobufPigLoader('de.pc2.dedup.fschunk.pig.PigProtocol.File');
Import tool (parts of it):
def getWriter(filenamePath: Path) : ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File] = {
val conf = new Configuration()
val fs = FileSystem.get(filenamePath.toUri(), conf)
val os = fs.create(filenamePath, true)
val writer = new ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File](os, classOf[de.pc2.dedup.fschunk.pig.PigProtocol.File])
return writer
}
val writer = getWriter(new Path(filename))
val builder = de.pc2.dedup.fschunk.pig.PigProtocol.File.newBuilder()
writer.write(builder.build)
writer.finish()
writer.close()
The import tool runs fine. I had a few problems with the ProtobufPigLoader because I cannot use the hadoop-lzo compression library, and without a fix (see here) ProtobufPigLoader isn't working. The problem where I have problems is that DUMP raw_data;
returns Unable to open iterator for alias raw_data
and ILLUSTRATE raw_data;
returns No (valid) input data found!
.
For me, it looks like the ProtobufBlockWriter data cannot be read by the ProtobufPigLoader. But what to use instead? How to write data in a external tool to HDFS so that it can be processed by ProtobufPigLoader.
Alternative question: What to use instead? How to write pretty large objects to Hadoop to consume it with Pig? The objects are not very complex, but contain a large list of sub-objects in a list (repeated field in Protobuf).
- I want to avoid any text format or JSON because they are simply to large for my data. I expect that it would bloat up the data by a factor of 2 or 3 (lots of integer, lots of byte strings that I would need to encode as Base64)..
- I want to avoid normalizing the data so that the id of the main object is attached to each of the subobjects (this is what is done now) because this also blows up the space consumption and makes joins necessary in the later processing.
Updates:
- I didn't use the generation of protobuf loader classes, but use the reflection type loader
- The protobuf classes are in a jar that is registered.
DESCRIBE
correctly shows the types.