Spark model is not visible to os after saving it in pyspark

Question

My goal is to save a spark model and then zip it, but I am having problems because os.exists(path) does not find the model that was just created. This is the code:

...
model.write().save(model_location)
model2 = PipelineModel(PipelineModel).load(model_location)  # the model is loaded
print(os.path.exists(model_location))  # prints False
shutil.make_archive(model_location, 'zip', model_location)  # this fails, file not found

I think the cause of the problem is this: os.path.exists() lies but I still don't know how to fix it. Obviously the model is created, as I can load it afterwards into model2, and after the run ends, the folder with the model is there. But, besides that, doing something like waiting until the folder is created does not work.

Or maybe it is a spark config problem, I am doing this in an Ambari cluster, and the code works in my local machine but not there, so I am not sure what the problem is.

score 2 · Accepted Answer · answered Oct 19 '18 at 23:44

Your mistake is to assume that the model will be saved to local, POSIX compliant file system.

ML models are saved using standard Spark SQL utilities, therefore will use a default file system, which under normal operations will point to a distributed file system, like HDFS.

Most likely you'll have to copy the model (which is stored as Parquet files) to local file system, and use it from there, although from the overall description you rather need one of the methods described in How to serve a Spark MLlib model?

Spark model is not visible to os after saving it in pyspark

1 Answers1