2

I have an Apache Spark cluster (1 master + 1 worker) running on docker, I'm able to submit a job using spark-submit that fits a pipeline and then it is saved (PipelineModel.save(path)). The file is saved on my local machine exactly in the point where I executed the spark-submit command.

The problem arises when I try to deploy a production code when I want to load the PipelineModel and use it for prediction. I'm not able to pass the folder containing the files which have been saved.

This is the code I use to submit the job:

spark-submit --class ch.supsi.isteps.Main --master spark://172.17.0.1:7077 --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 --files=test/aFolder ./STR-0.1-alpha.jar --mode=production --file=test/aFolder

where --mode=production --file=test/aFolder are parameters to my program

I already tried to use --files, but it doesn't accept folders. I would like to avoid copying the model in all the worker nodes.

EDIT

The problem is related to HDFS and Docker. As a backup solution we avoided working with spark-cluster inside Docker and switched to local mode inside Docker. This allowed to save and retrieve the file without problems. If you map the folders (docker-compose -> volumes) you don't even need to pass the files as they are already mapped to your containers

1 Answers1

0

I already tried to use --files, but it doesn't accept folders

Option 1:

SparkContext has below method to add files you can loop and list of files in your folder and add them.

/**
* Add a file to be downloaded with this Spark job on every node.
*
* If a file is added during execution, it will not be available until the next TaskSet starts.
*
* @param path can be either a local file, a file in HDFS (or other Hadoop-supported
* filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs,
* use `SparkFiles.get(fileName)` to find its download location.
*/
def addFile(path: String): Unit = {
addFile(path, false)
}

as mentioned above... SparkFiles.get(fileName) you can get the file name

or else SparkFiles has getRootDirectory to get the folder where you have added files and them you can access them.

/** 
  * Get the root directory that contains files added through `SparkContext.addFile()`. 
 */ 
 def getRootDirectory(): String = 
 SparkEnv.get.driverTmpDir.getOrElse(".") 

 } 

or else

using sparkcontext.listFiles you can get the list of files as sequence.

Option 2 : If you want to go ahead with --files option then you can follow my answer submitting multiple jars from a folder using same approach you can add multiple files from a folder separated by delimiter as well.

Hope this helps!

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121