Equivalent of Distributed Cache in Spark?

Question

In Hadoop, you can use the distributed cache to copy read-only files on each node. What is the equivalent way of doing so in Spark? I know about broadcast variables, but that is only good for variables, not files.

why can you load file in list or map and then broadcast it?? — banjara, Jun 25 '15 at 07:13

score 7 · Accepted Answer · edited Jun 20 '20 at 09:12

7

Take a look at SparkContext.addFile()

Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get(fileName) to find its download location.

A directory can be given if the recursive option is set to true. Currently directories are only supported for Hadoop-supported filesystems.

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 25 '15 at 06:57

Piotr Rudnicki

424
2
7

Would it also work with Amazon's S3? – MetallicPriest Jun 25 '15 at 07:08
1

Not 100% sure, but I believe it would (can't test it right now). Spark is falling back to Hadoop's Path for non http/ftp URIs : https://github.com/apache/spark/blob/43f50decdd20fafc55913c56ffa30f56040090e4/core/src/main/scala/org/apache/spark/SparkContext.scala#L1325 and I think it does handle S3 URIs. – Piotr Rudnicki Jun 25 '15 at 07:22

score -3 · Answer 2 · answered Mar 12 '17 at 06:10

-3

if your files are text files, living in HDFS, then , you can use:

textFile("<hdfs-path>") of "SparkContext".

this call shall give you an RDD, which you can persist across nodes, by using method: "persist()" of that RDD.

this method can persist file data (serialized/deserialized), within MEMORY/DISK.

refer:

http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose

answered Mar 12 '17 at 06:10

G.S.Tomar

290
2
14

This does NOT put the file on each node. It distributes it across nodes. – The Archetypal Paul Mar 12 '17 at 22:31
@TheArchetypalPaul: yes, you are right. I overlooked requirement and suggested persist(), which just deals with individual partition of the RDD, instead of copying whole file on each node in the cluster. – G.S.Tomar Apr 05 '17 at 11:20

Equivalent of Distributed Cache in Spark?

2 Answers2

Linked