6

In Hadoop, you can use the distributed cache to copy read-only files on each node. What is the equivalent way of doing so in Spark? I know about broadcast variables, but that is only good for variables, not files.

MetallicPriest
  • 29,191
  • 52
  • 200
  • 356

2 Answers2

7

Take a look at SparkContext.addFile()

Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get(fileName) to find its download location.

A directory can be given if the recursive option is set to true. Currently directories are only supported for Hadoop-supported filesystems.

Community
  • 1
  • 1
Piotr Rudnicki
  • 424
  • 2
  • 7
  • Would it also work with Amazon's S3? – MetallicPriest Jun 25 '15 at 07:08
  • 1
    Not 100% sure, but I believe it would (can't test it right now). Spark is falling back to Hadoop's Path for non http/ftp URIs : https://github.com/apache/spark/blob/43f50decdd20fafc55913c56ffa30f56040090e4/core/src/main/scala/org/apache/spark/SparkContext.scala#L1325 and I think it does handle S3 URIs. – Piotr Rudnicki Jun 25 '15 at 07:22
-3

if your files are text files, living in HDFS, then , you can use:

textFile("<hdfs-path>") of "SparkContext".

this call shall give you an RDD, which you can persist across nodes, by using method: "persist()" of that RDD.

this method can persist file data (serialized/deserialized), within MEMORY/DISK.

refer:

http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose

G.S.Tomar
  • 290
  • 2
  • 14
  • This does NOT put the file on each node. It distributes it across nodes. – The Archetypal Paul Mar 12 '17 at 22:31
  • @TheArchetypalPaul: yes, you are right. I overlooked requirement and suggested persist(), which just deals with individual partition of the RDD, instead of copying whole file on each node in the cluster. – G.S.Tomar Apr 05 '17 at 11:20