2

I have a question. I am trying to serialize a PySpark ML model to mleap. However, the model makes use of the SQLTransformer to do some column-based transformations e.g. adding log-scaled versions of some columns. As we all know, Mleap doesn't support SQLTransformer - see here : https://github.com/combust/mleap/issues/126 so I've implemented the former of these 2 suggestions:

  • For non-row operations, move the SQL out of the ML Pipeline that you plan to serialize
  • For row-based operations, use the available ML transformers or write a custom transformer <- this is where the custom transformer documentation will help.

I've externalized the SQL transformation on the training data used to build the model, and I do the same for the input data when I run the model for evaluation.

The problem I'm having is that I'm unable to obtain the same results across the 2 models. Model 1 - Pure Spark ML model containing

SQLTransformer + later transformations : StringIndexer -> 
OneHotEncoderEstimator -> VectorAssembler -> RandomForestClassifier

Model 2 - Externalized version with SQL queries run on training data in building the model. The transformations are everything after SQLTransformer in Model 1:

 StringIndexer -> OneHotEncoderEstimator -> 
 VectorAssembler -> RandomForestClassifier

I'm wondering how I could go about debugging this problem. Is there a way to somehow compare the results after each stage to see where the differences show up ? Any suggestions are appreciated.

femibyte
  • 3,317
  • 7
  • 34
  • 59

0 Answers0