0

Scenario:

  • There are 1000s of massive binary files on HDFS

  • There is a def decode(String localFilePath): Array[MyCustomType] which can decode a file given its local path to its array of records

How can I use Scala spark to load these files in parallel and get a RDD[MyCustomType] in return?

PS. decode is a thrift decoder which gets a local file name loads a thrift file into memory as an array of records.

I think the missing puzzle here is downloading file from HDFS to a node and pass the local name to decode..

Morteza Shahriari Nia
  • 1,392
  • 18
  • 24

1 Answers1

0

The solution is that you need to use spark's PortableDataStream to stream the binary data that is on HDFS to compute nodes.

val documentsRDD: RDD[Document] = sparkContext.binaryFiles("/data/path/on/hdfs")
  .flatMap { case (f: String, p: PortableDataStream) => {
    val stream: BufferedInputStream = new BufferedInputStream(new GZIPInputStream(dis), 2048)
    // you can take it from here and do the rest. in my case I was dealing with thrift:
    val protocol: TBinaryProtocol = new TBinaryProtocol(new TIOStreamTransport(stream))
  } 
}
Morteza Shahriari Nia
  • 1,392
  • 18
  • 24