5

I know that one can send files through spark-submit with the --files option, but is it also possible to send a whole folder?

Actually I want to send the lib folder, containing jar files of external libraries. Or does the --jars option already make a lib folder on the executor directory? In my case it is necessary, that there is a lib folder, otherwise it would give an error.

pythonic
  • 20,589
  • 43
  • 136
  • 219

3 Answers3

2

No, spark-submit --files option doesn't support sending folder, but you can put all your files in a zip, use that file in --files list. You can use SparkFiles.get(filename) in your spark job to load the file, explode it and use exploded files. 'filename' doesn't need to be absolute path, just filename does it.

PS: It works only after SparkContext has been initialized.

Shahzad
  • 508
  • 7
  • 16
  • I didn't understand your answer. So If I put the files in a zip file say lib.zip, would spark-submit unextract the files in a folder lib on the executor's directory? – pythonic Sep 01 '17 at 15:39
  • Nopes, I was referring to passing a folder containing multiple conf files and my answer was related to --files. both --jars and --files copy files to working directory of executor but list of jars given in --files would not be included in classpath. In order to use jars in classpath, jars must be given in --jars list. --jars doesn't support folder inclusion. – Shahzad Sep 01 '17 at 18:33
  • Provide conf files in --files, dependencies in --jars list and you should be good as they would be copied to working directory of the executor automatically. – Shahzad Sep 01 '17 at 18:39
0

You can do this:

  1. Archive the folder --> myfiles.zip
  2. Ship the archive using the "spark.yarn.dist.archives" conf:
    Example:
spark-submit \
...
--conf spark.yarn.dist.archives=myfiles.zip
...
-1

I think you have multiple solutions to do this.

First I can understand that you want to automatize this, but if you don't have much jars you can just pass them one by one as arguments to the --jars option.

Otherwise you can just sudo mv all your jars in the spark/jars directory of your Spark installation, but it's annoying in the case of cluster.

So finally, you can do this

bash solution

This doesn't resolve the problem if you need the cluster mode. For a cluster mode, I would just modify the bash code for querying an HDFS directory of your jars. And put all your jars in the HDFS directory.

Maybe there are other solutions, but that was my thoughts,

Have a good week end !

tricky
  • 1,413
  • 1
  • 15
  • 26
  • 1
    Well I have a python script, that can read all the jar files anyway, but the question is how to have them on the lib folder of the executor's directory. Does spark-submit by default put all files specified by --jars in a lib folder, or I have to create one by myself. – pythonic Sep 01 '17 at 15:37