I was trying to use Scikit-Learn(sklearn) Isolation Forest for anomaly detection. I also converted model into PMML format using sklearn2pmml library. Ideally prediction using both (pickle and PMML(using JPMML Evaluator)) file should produce same result as PMML has been generated using the same model.
But I found that prediction(anomaly score) for couple of records are not matching at 3/4 decimal points. e.g. Prediction using pickle:- 0.975643 and Prediction using PMML file for same record :- 0.975498.
I tried to replicate this problem with boston dataset which is in built in sklearn. I created isolation forest (with no of trees: 1) and converted to pickle file and PMML.I then converted pickle to text format using python code(how to extract the decision rules from scikit-learn decision-tree?) and compared it with PMML file.
I found that the way Python stores tree node thresholds is different from PMML. e.g. PMML :- SimplePredicate field="CHAS" operator="lessOrEqual" value="0.08887574"
Python Pickle :- if CHAS <= 0.0888757431362:
So pickle is storing Thresholds in numpy float64 format and PMML is storing in numpy float 32 I guess.
But looking at sklearn code on github , I guess that sklearn also stores tree thresholds in numpy float32 format. Then why my pickle is storing values as float64 and not float32 because 0.0888757431362 is certainly not float32.
Please let me know if anybody has any idea on this issue or my assumption is wrong.