I have built a preliminary ML (PySpark
) model with sample data on my PC (Windows
) and the accuracy is around 70%. After persisting model binary
on disk I am reading it from a different jupyter notebook and the accuracy is somewhere near 70%. Now if I do the same thing on our cluster (MapR/Unix
), after reading the model binary
from disk, accuracy goes down to 10-11% (the dataset is also exactly same). Even with the full dataset I got the same issue (just for information).
As the cluster has Unix OS, I tried training-persisting-testing the model in a docker container (Unix), but no issue there. The issue is only with the cluster.
I have been scratching my head since then about what might be causing this and how to resolve it. Please help.
Edit:
It's a classification problem and I have used pyspark.ml.classification.RandomForestClassifier
.
To persist the models I am simply using the standard setup:
model.write().overwrite().save(model_path)
And to load the model:
model = pyspark.ml.classification.RandomForestClassificationModel().load(model_path)
I have used StringIndexer
, OneHotEncoder
etc in the model and have also persisted them on disk to in order to use them in the other jupyter notebook (same way as the main model).
Edit:
Python: 3.x
Spark: 2.3.1