What's the difference between --archives, --files, py-files in pyspark job arguments

Question

--archives, --files, --py-files and sc.addFile and sc.addPyFile are quite confusing, can someone explain these clearly?

The last two are explicitly from the SparkContext object, and the first 3 are from the terminal (though I can't seem to find a reference to archives in the submitting applications documentation) — OneCricketeer, Jun 28 '16 at 03:04

score 19 · Answer 1 · edited Jun 20 '20 at 09:12

These options are truly scattered all over the place.

In general, add your data files via --files or --archives and code files via --py-files. The latter will be added to the classpath (c.f., here) so you could import and use.

As you can imagine, the CLI arguments is actually dealt with by addFile and addPyFiles functions (c.f., here)

From http://spark.apache.org/docs/latest/programming-guide.html

Behind the scenes, pyspark invokes the more general spark-submit script.

You can add Python .zip, .egg or .py files to the runtime path by passing a comma-separated list to --py-files

From http://spark.apache.org/docs/latest/running-on-yarn.html

The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.

From http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=addpyfile#pyspark.SparkContext.addPyFile

addFile(path) Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

addPyFile(path) Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

Why there is no `addArchives(path)` function, how can I add archives from code. — xiaobing, Jan 15 '19 at 03:48
Very nice answer sir, the answer may be more complete with a SparkFiles.get reference (https://spark.apache.org/docs/2.4.0/api/python/pyspark.html#pyspark.SparkFiles) Since It is not very clear what's its relation with the other stuff you described and what should be a proper use of it among the rest. — ciurlaro, Apr 02 '21 at 10:39

alex li · Answer 2 · 2022-11-21T03:12:56.917

This is also confused me earlier, after experiment with Spark 3.0 today, I finaly understand how these options working.

We can devide these options into two categories.

The first category is data file, data files means spark only add the specified files into containers, no further commands will be executed. There are two options in this category:

--archives: with this option, you can submit archives, and spark will extract files in it for you, spark support zip, tar ... formats.
--files: with this option, you can submit files, spark will put it in container, won't do any other things. sc.addFile is the programming api for this one.

The second category is code dependencies. In spark application, code dependency could be JVM dependency or python dependency for pyspark application.

--jars ：this option is used to submit JVM dependency with Jar file, spark will add these Jars into CLASSPATH automatically, so your JVM can load them.
--py-files: this option is used to submit Python dependency, it can be .py, .egg or .zip. spark will add these file into PYTHONPATH, so your python interpreter can find them.

sc.addPyFile is the programming api for this one.

PS: for single .py file, spark will add it into a __pyfiles__ folder, others will add into CWD.

All these four options can specified multiple files, splitted with ,, and for each file, you can specified an alias through {URL}#{ALIAS} format. Don't specify alias in --py-files option, cause spark won't add alias into PYTHONPATH.

BTW, every file support multiple scheme, if don't speify, default is file:

file: Driver will transfer these files to Executor through HTTP, if in cluster deploy mode, Spark will first upload these file to cluster Driver.
hdfs:, http:, https:, ftp: Driver and Executors will download specified files from correspond fs.
local: The file is expected to exist as a local file on each worker node.

reference

https://spark.apache.org/docs/latest/submitting-applications.html

What's the difference between --archives, --files, py-files in pyspark job arguments

2 Answers2

reference

Linked