Load Arbitrary Binary Files on HDFS in Scala Spark

Question

Scenario:

There are 1000s of massive binary files on HDFS
There is a def decode(String localFilePath): Array[MyCustomType] which can decode a file given its local path to its array of records

How can I use Scala spark to load these files in parallel and get a RDD[MyCustomType] in return?

PS. decode is a thrift decoder which gets a local file name loads a thrift file into memory as an array of records.

I think the missing puzzle here is downloading file from HDFS to a node and pass the local name to decode..

Might be vaguely and distantly related to this https://stackoverflow.com/questions/37635065/read-uncompressed-thrift-files-in-spark — Morteza Shahriari Nia, Feb 14 '18 at 22:25

score 0 · Accepted Answer · answered Feb 16 '18 at 21:16

The solution is that you need to use spark's PortableDataStream to stream the binary data that is on HDFS to compute nodes.

val documentsRDD: RDD[Document] = sparkContext.binaryFiles("/data/path/on/hdfs")
  .flatMap { case (f: String, p: PortableDataStream) => {
    val stream: BufferedInputStream = new BufferedInputStream(new GZIPInputStream(dis), 2048)
    // you can take it from here and do the rest. in my case I was dealing with thrift:
    val protocol: TBinaryProtocol = new TBinaryProtocol(new TIOStreamTransport(stream))
  } 
}

Load Arbitrary Binary Files on HDFS in Scala Spark

1 Answers1