How to load some files into Spark nodes without duplication?

Question

I have some text files on master server to be processed by a Spark cluster for some statistics purpose.

For example, I have 1.txt, 2.txt,3.txt on master server in a specified directory like /data/.I want use a Spark cluster to process all of them one times. If I use sc.textFile("/data/*.txt") to load all files, other node in cluster cannot find these files on local file system. However if I use sc.addFile and SparkFiles.get to achieve them on each node, the 3 text files will be downloaded to each node and all of them will be processed multi times.

How to solve it without HDFS? Thanks.

You have to mount the files in all node in same path. That way, each partition will load unique files. — rakesh, Mar 01 '17 at 08:35
So one same file will not be loaded twice? What if the files on each nodes are different? Which component synchronize local file reading? — Junfeng, Mar 02 '17 at 01:06
Related : https://stackoverflow.com/questions/38879478/sparkcontext-addfile-vs-spark-submit-files — duplex143, Jul 22 '22 at 08:43

score -1 · Accepted Answer · answered Mar 02 '17 at 01:12

According to official document, just copy all files to all nodes.

http://spark.apache.org/docs/1.2.1/programming-guide.html#external-datasets

If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

How to load some files into Spark nodes without duplication?

1 Answers1