I have an Apache Spark cluster (1 master + 1 worker) running on docker, I'm able to submit a job using spark-submit
that fits a pipeline and then it is saved (PipelineModel.save(path)).
The file is saved on my local machine exactly in the point where I executed the spark-submit
command.
The problem arises when I try to deploy a production code when I want to load the PipelineModel
and use it for prediction. I'm not able to pass the folder containing the files which have been saved.
This is the code I use to submit the job:
spark-submit --class ch.supsi.isteps.Main --master spark://172.17.0.1:7077 --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 --files=test/aFolder ./STR-0.1-alpha.jar --mode=production --file=test/aFolder
where --mode=production --file=test/aFolder
are parameters to my program
I already tried to use --files
, but it doesn't accept folders. I would like to avoid copying the model in all the worker nodes.
EDIT
The problem is related to HDFS and Docker. As a backup solution we avoided working with spark-cluster inside Docker and switched to local mode inside Docker. This allowed to save and retrieve the file without problems. If you map the folders (docker-compose -> volumes) you don't even need to pass the files as they are already mapped to your containers