2

I have some text files on master server to be processed by a Spark cluster for some statistics purpose.

For example, I have 1.txt, 2.txt,3.txt on master server in a specified directory like /data/.I want use a Spark cluster to process all of them one times. If I use sc.textFile("/data/*.txt") to load all files, other node in cluster cannot find these files on local file system. However if I use sc.addFile and SparkFiles.get to achieve them on each node, the 3 text files will be downloaded to each node and all of them will be processed multi times.

How to solve it without HDFS? Thanks.

Junfeng
  • 93
  • 9
  • You have to mount the files in all node in same path. That way, each partition will load unique files. – rakesh Mar 01 '17 at 08:35
  • So one same file will not be loaded twice? What if the files on each nodes are different? Which component synchronize local file reading? – Junfeng Mar 02 '17 at 01:06
  • Related : https://stackoverflow.com/questions/38879478/sparkcontext-addfile-vs-spark-submit-files – duplex143 Jul 22 '22 at 08:43

1 Answers1

-1

According to official document, just copy all files to all nodes.

http://spark.apache.org/docs/1.2.1/programming-guide.html#external-datasets

If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

Junfeng
  • 93
  • 9