In Hadoop, you can use the distributed cache to copy read-only files on each node. What is the equivalent way of doing so in Spark? I know about broadcast variables, but that is only good for variables, not files.
-
why can you load file in list or map and then broadcast it?? – banjara Jun 25 '15 at 07:13
2 Answers
Take a look at SparkContext.addFile()
Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get(fileName) to find its download location.
A directory can be given if the recursive option is set to true. Currently directories are only supported for Hadoop-supported filesystems.

- 1
- 1

- 424
- 2
- 7
-
-
1Not 100% sure, but I believe it would (can't test it right now). Spark is falling back to Hadoop's Path for non http/ftp URIs : https://github.com/apache/spark/blob/43f50decdd20fafc55913c56ffa30f56040090e4/core/src/main/scala/org/apache/spark/SparkContext.scala#L1325 and I think it does handle S3 URIs. – Piotr Rudnicki Jun 25 '15 at 07:22
if your files are text files, living in HDFS, then , you can use:
textFile("<hdfs-path>")
of "SparkContext".
this call shall give you an RDD, which you can persist across nodes, by using method: "persist()
" of that RDD.
this method can persist file data (serialized/deserialized), within MEMORY/DISK.
refer:
http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose

- 290
- 2
- 14
-
This does NOT put the file on each node. It distributes it across nodes. – The Archetypal Paul Mar 12 '17 at 22:31
-
@TheArchetypalPaul: yes, you are right. I overlooked requirement and suggested persist(), which just deals with individual partition of the RDD, instead of copying whole file on each node in the cluster. – G.S.Tomar Apr 05 '17 at 11:20