It seems overkill to use Spark directly ... If this data is going to be 'collected' to the driver, why not use the HDFS API? Often Hadoop is bundled with Spark. Here is an example:
import org.apache.hadoop.fs._
import org.apache.hadoop.conf._
val fileSpec = "/data/Invoices/20171123/21"
val conf = new Configuration()
val fs = org.apache.hadoop.fs.FileSystem.get(new URI("hdfs://nameNodeEneteredHere"),conf)
val path = new Path(fileSpec)
// if(fs.exists(path) && fs.isDirectory(path) == true) ...
val fileList = fs.listStatus(path)
Then with println(fileList(0))
, info (formatted) like this first item (as an example) can be seen as org.apache.hadoop.fs.FileStatus
:
FileStatus {
path=hdfs://nameNodeEneteredHere/Invoices-0001.avro;
isDirectory=false;
length=29665563;
replication=3;
blocksize=134217728;
modification_time=1511810355666;
access_time=1511838291440;
owner=codeaperature;
group=supergroup;
permission=rw-r--r--;
isSymlink=false
}
Where fileList(0).getPath
will give hdfs://nameNodeEneteredHere/Invoices-0001.avro
.
I guess this means of reading files would primarily be with the HDFS namenode and not within each executor. TLDR; I'm betting Spark would likely poll the namenode to get RDDs. If the underlying Spark call polls the namenode to manage the RDDs, perhaps the above is an efficient solution. Still, contributive comments suggesting either direction would be welcome.