0

I have built a preliminary ML (PySpark) model with sample data on my PC (Windows) and the accuracy is around 70%. After persisting model binary on disk I am reading it from a different jupyter notebook and the accuracy is somewhere near 70%. Now if I do the same thing on our cluster (MapR/Unix), after reading the model binary from disk, accuracy goes down to 10-11% (the dataset is also exactly same). Even with the full dataset I got the same issue (just for information).

As the cluster has Unix OS, I tried training-persisting-testing the model in a docker container (Unix), but no issue there. The issue is only with the cluster.

I have been scratching my head since then about what might be causing this and how to resolve it. Please help.

Edit:

It's a classification problem and I have used pyspark.ml.classification.RandomForestClassifier.

To persist the models I am simply using the standard setup:

model.write().overwrite().save(model_path)

And to load the model:

model = pyspark.ml.classification.RandomForestClassificationModel().load(model_path)

I have used StringIndexer, OneHotEncoder etc in the model and have also persisted them on disk to in order to use them in the other jupyter notebook (same way as the main model).

Edit:

Python: 3.x
Spark: 2.3.1

Mrinal
  • 1,826
  • 2
  • 19
  • 31
  • Sounds very weird. Could we get some more information about the model? – pissall Oct 08 '19 at 03:57
  • Sure. I will edit my question to add more info. Please let me know if you need anything else. – Mrinal Oct 08 '19 at 04:07
  • The serialization strategy that you used it to persist the model, kind of model, etc. – pissall Oct 08 '19 at 04:09
  • did you fix the random_state ? Here is an explanation why this can have an impact: https://stackoverflow.com/questions/39158003/confused-about-random-state-in-decision-tree-of-scikit-learn – PV8 Oct 08 '19 at 07:56
  • No I didn't. And I am using the same code (end to end) everywhere. So shouldn't matter even if I did. – Mrinal Oct 08 '19 at 08:28

0 Answers0