SparkContext.addFile vs spark-submit --files

Question

I am using Spark 1.6.0. I want to pass some properties files like log4j.properties and some other customer properties file. I see that we can use --files but I also saw that there is a method addFile in SparkContext. I did prefer to use --files instead of programatically adding the files, assuming both the options are same ?

I did not find much documentation about --files, so is --files & SparkContext.addFile both options same ?

References I found about --files and for SparkContext.addFile.

score 25 · Accepted Answer · answered Aug 10 '16 at 17:19

25

It depends whether your Spark application is running in client or cluster mode.

In client mode the driver (application master) is running locally and can access those files from your project, because they are available within the local file system. SparkContext.addFile should find your local files and work like expected.

If your application is running in cluster mode. The application is submitted via spark-submit. This means that your whole application is transfered to the Spark master or Yarn, which starts the driver (application master) within the cluster on a specific node and within an separated environment. This environment has no access to your local project directory. So all necessary files has to be transfered as well. This can be achieved with the --files option. The same concept applies to jar files (dependencies of your Spark application). In cluster mode, they need to be added with the --jars option to be available within the classpath of the application master. If you use PySpark there is a --py-files option.

answered Aug 10 '16 at 17:19

gclaussn

1,736
16
19

3

SparkContext.addFile java doc says "Add a file to be downloaded with this Spark job on every node.", so it seems like --files, addFile also transfers the files to every node ? – Abdullah Shaikh Aug 10 '16 at 18:00
2

Parts of your application running distributed (because Spark is a cluster computing framework), so the resources are likely needed on every node and therefore will be distributed to be accessible on every computing executor. – gclaussn Aug 10 '16 at 18:16
4

Dumb question, does that mean I can use either addFile or --file to transfer file to cluster if I am running on cluster mode ? and if client mode I just need to use the addFile and not --files ? – Abdullah Shaikh Aug 10 '16 at 18:23
8

There are no dumb questions! :-) If your files are available via http, hdfs, etc. you should be able to use addFile and --files in client as well as in cluster mode. In cluster mode, a local file, which has not been added to the spark-submit will not be found via addFile. This is because the driver (application master) is started on the cluster and is already running when he reaches the addFile call. It is to late at this point. The application has already been submited, and the local file system is the file system of a specific cluster node. – gclaussn Aug 10 '16 at 18:38
2

@gclaussn So in client mode we can use 'addFile' and '--files' options (if files are local) right? – goks Nov 09 '17 at 21:27
@gclaussn I am running yarn in cluster deploy mode. And I need the uploaded file at the time of the DAG creation in the driver side. Will the --file approach work?' – Geek Sep 17 '19 at 04:48

SparkContext.addFile vs spark-submit --files

1 Answers1

Linked