6

I've developed a spam classifier using pandas and scikit learn to the point where it's ready for integration into our hadoop-based system. To this end, I need to export my classifier to a more common format than pickling.

The Predictive Model Markup Language (PMML) is my preferred export format. It plays exceedingly well with Cascading, which we already use. However, I surprisingly cannot find any python libraries that export scikit-learn models into PMML.

Has anyone had experience with this use case? Is there any sort of alternative to PMML that would lend interoperability between scikit-learn and hadoop? What about a solid PMML export library?

Axel Magnuson
  • 1,192
  • 1
  • 10
  • 26
  • there's a similar question over at Quora http://www.quora.com/How-do-I-use-scikit-learn-with-Hadoop-and-Mapreduce – miraculixx Jun 13 '14 at 19:37
  • Thanks for the input. Using the streaming API is not ideal, but I may have to resort to it if all else fails. – Axel Magnuson Jun 13 '14 at 19:52
  • Spam classification as in email spam? How did you come to use a Random Forest for that? – Raff.Edward Jun 14 '14 at 03:28
  • Actually in this case, it's microblog spam where we are targeting only a subset of all machine-generated messages. The relative variety of features seems to play nice with random forest. – Axel Magnuson Jun 14 '14 at 20:21

1 Answers1

9

You could use Py2PMML to export the model to PMML and then evaluate it on Hadoop using JPMML-Cascading. JPMML is open source but Py2PMML from Zementis seems to be a commercial product. Besides this alternative there are no other tools to score Scikit models exported as PMML on Java/Hadoop. The core scikit team is planning to implement a PMML exporter though. But if you don't want any commercial solutions or wait for such tool to be implemented you still have some options but they require some coding:

  • Adapt the SKLearn Compiled trees project so it generates Java/MapReduce code instead of C.
  • Using the export_graphvizfunction obtain the DOT representation of each decision tree and write a small Java interpreter.
  • Forget about Java and Hadoop, use Apache Spark and evaluate each one of the decision trees in parallel using Python, Scikit and PySpark.

Hope it helps!

Mauro D.
  • 186
  • 1
  • 4
  • 3
    The export of SkLearn models to PMML can be handled by the JPMML-SkLearn (https://github.com/jpmml/jpmml-sklearn) library/command-line application now. It is much more robust and easier to work with than Py2PMML. – user1808924 Oct 15 '15 at 09:28