Write data that can be read by ProtobufPigLoader from Elephant Bird

Question

For a project of mine, I want to analyse around 2 TB of Protobuf objects. I want to consume these objects in a Pig Script via the "elephant bird" library. However it is not totally clear to my how to write a file to HDFS so that it can be consumed by the ProtobufPigLoader class.

This is what I have:

Pig script:

  register ../fs-c/lib/*.jar // this includes the elephant bird library
  register ../fs-c/*.jar    
  raw_data = load 'hdfs://XXX/fsc-data2/XXX*' using com.twitter.elephantbird.pig.load.ProtobufPigLoader('de.pc2.dedup.fschunk.pig.PigProtocol.File');

Import tool (parts of it):

def getWriter(filenamePath: Path) : ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File] = {
  val conf = new Configuration()
  val fs = FileSystem.get(filenamePath.toUri(), conf)
  val os = fs.create(filenamePath, true)
  val writer = new ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File](os, classOf[de.pc2.dedup.fschunk.pig.PigProtocol.File])
  return writer
}
val writer = getWriter(new Path(filename))
val builder = de.pc2.dedup.fschunk.pig.PigProtocol.File.newBuilder()
writer.write(builder.build)
writer.finish()
writer.close()

The import tool runs fine. I had a few problems with the ProtobufPigLoader because I cannot use the hadoop-lzo compression library, and without a fix (see here) ProtobufPigLoader isn't working. The problem where I have problems is that DUMP raw_data; returns Unable to open iterator for alias raw_data and ILLUSTRATE raw_data; returns No (valid) input data found!.

For me, it looks like the ProtobufBlockWriter data cannot be read by the ProtobufPigLoader. But what to use instead? How to write data in a external tool to HDFS so that it can be processed by ProtobufPigLoader.

Alternative question: What to use instead? How to write pretty large objects to Hadoop to consume it with Pig? The objects are not very complex, but contain a large list of sub-objects in a list (repeated field in Protobuf).

I want to avoid any text format or JSON because they are simply to large for my data. I expect that it would bloat up the data by a factor of 2 or 3 (lots of integer, lots of byte strings that I would need to encode as Base64)..
I want to avoid normalizing the data so that the id of the main object is attached to each of the subobjects (this is what is done now) because this also blows up the space consumption and makes joins necessary in the later processing.

Updates:

I didn't use the generation of protobuf loader classes, but use the reflection type loader
The protobuf classes are in a jar that is registered. DESCRIBE correctly shows the types.

Did you find an answer to these questions? I'm looking at doing the same thing. Thanks — Dolan Antenucci, Jun 27 '12 at 18:43
No. I don't really need it anymore, but I started a bounty for it. — dmeister, Jun 28 '12 at 04:43
The uncomfortable thing of `elephant bird` is that it does not provide a simple way to remove the LZO dependency. However, we write our custom LoadFunc and StoreFunc to tackle our custom data based on `com.twitter.elephantbird.pig.util.ProtobufToPig` in `elephant bird`. — zsxwing, Jun 26 '13 at 05:56
Not sure if it is sufficient, but perhaps this helps: [ERROR 1066: Unable to open iterator for alias](http://stackoverflow.com/questions/34495085/error-1066-unable-to-open-iterator-for-alias-in-pig-generic-solution) here is a [generic solution](http://stackoverflow.com/a/34495086/983722). — Dennis Jaheruddin, Dec 28 '15 at 14:50

Write data that can be read by ProtobufPigLoader from Elephant Bird

0 Answers0