1

I have trained a model using sklearn and exported it into a pmml format using sklearn2pmml. Is there a way to convert that pmml file back into something that can be imported and run in python?

The reason I am looking to do this is because I have noticed slight differences in the way the pmml model behaves compared to the sklearn model. Specifically, the pmml file sets hard upper and lower bounds for variables (uses the max and min of the variable in the training set) whereas sklearn does not. I encounter problems when the pmml model encounters data that is outside of these bounds. This is just one difference between the pmml model and the sklearn model and I want to be able to re-import the pmml file into python to run it and see if there are any others.

swang16
  • 799
  • 1
  • 9
  • 13
  • Classical question: Why do you need this PMML intermediate step, if you train your models using Python, and want to deploy them using Python? Why not Pickle? – user1808924 Mar 10 '17 at 16:44
  • I used python to build the model but the team I am handing off to deploy it uses java, hence the conversion to pmml – swang16 Mar 10 '17 at 18:23
  • Depending on what type of model, duplicate of http://stackoverflow.com/questions/41630562/chaid-pmml-parsing-in, http://stackoverflow.com/questions/41466964/python-tools-for-consuming-pmml-models, http://stackoverflow.com/questions/40048987/importing-pmml-models-into-python-scikit-learn, http://stackoverflow.com/questions/41383735/run-pmml-clustering-code-in-python and/or http://stackoverflow.com/questions/40532336/how-to-import-logistic-regression-and-kmeans-pmml-files-into-r ... – nekomatic Mar 13 '17 at 09:05

1 Answers1

2

You don't need to test the correctness of sklearn2pmml generated models. It's based on the JPMML-SkLearn library, which has full coverage with integration tests - Scikit-Learn predictions and PMML predictions are provably identical.

Your real issue is that you want to apply models outside of their intended "applicability domain". It's a bead idea, because model's behaviour is not specified in that case - garbage input, garbage predictions.

However, if you insist that you must be able to feed garbage to your models in production environment, then simply disable PMML value bounds checking. There are many ways how this can be accomplished:

  1. Remove Value and Interval child elements from /PMML/DataDictionary/DataField elements.
  2. Modify Value and Interval child elements so that those previously unseen values would be recognized as valid values. For example, you can define the margins of the Input element to include all values [-Inf, +Inf]. See the explanation of Value and Interval elements in the PMML specification for correct syntax.
  3. Change the invalidValueTreatment attribute value of all /PMML/<Model>/MiningSchema/MiningField elements from "returnInvalid" to "asIs". If this attribute is missing, then it defaults to "returnInvalid". So you'd need to insert invalidValueTreatment=asIs there.

I would recommend option #3. You can automate the process using JPMML-Model library:

org.dmg.pmml.PMML pmml = loadFromFile(..)
org.dmg.pmml.Visitor mfUpdater = new org.jpmml.model.visitors.AbstractVisitor(){
  @Override
  public VisitorAction visit(MiningField miningField){
    miningField.setInvalidValueTreatment(InvalidValueTreatmentMethod.AS_IS);
    return VisitorAction.CONTINUE;
  }
}
mfUpdater.applyTo(pmml);
saveToFile(pmml, ...)
user1808924
  • 4,563
  • 2
  • 17
  • 20
  • Also, sklearn2pmml lets you specify `asIs` invalid value treatment during model generation. Simply replace `CountinuousDomain()` with `ContinuousDomain(invalid_value_treatment = "as_is")` in your Python script: https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/decoration/__init__.py#L21 – user1808924 Mar 10 '17 at 21:46