Scenario:
There are 1000s of massive binary files on HDFS
There is a
def decode(String localFilePath): Array[MyCustomType]
which can decode a file given its local path to its array of records
How can I use Scala spark to load these files in parallel and get a RDD[MyCustomType]
in return?
PS. decode
is a thrift decoder which gets a local file name loads a thrift file into memory as an array of records.
I think the missing puzzle here is downloading file from HDFS to a node and pass the local name to decode
..