Exporting a Scikit Learn Random Forest for use on Hadoop Platform

Question

I've developed a spam classifier using pandas and scikit learn to the point where it's ready for integration into our hadoop-based system. To this end, I need to export my classifier to a more common format than pickling.

The Predictive Model Markup Language (PMML) is my preferred export format. It plays exceedingly well with Cascading, which we already use. However, I surprisingly cannot find any python libraries that export scikit-learn models into PMML.

Has anyone had experience with this use case? Is there any sort of alternative to PMML that would lend interoperability between scikit-learn and hadoop? What about a solid PMML export library?

there's a similar question over at Quora http://www.quora.com/How-do-I-use-scikit-learn-with-Hadoop-and-Mapreduce — miraculixx, Jun 13 '14 at 19:37
Thanks for the input. Using the streaming API is not ideal, but I may have to resort to it if all else fails. — Axel Magnuson, Jun 13 '14 at 19:52
Spam classification as in email spam? How did you come to use a Random Forest for that? — Raff.Edward, Jun 14 '14 at 03:28
Actually in this case, it's microblog spam where we are targeting only a subset of all machine-generated messages. The relative variety of features seems to play nice with random forest. — Axel Magnuson, Jun 14 '14 at 20:21

score 9 · Accepted Answer · answered Jun 13 '14 at 22:54

You could use Py2PMML to export the model to PMML and then evaluate it on Hadoop using JPMML-Cascading. JPMML is open source but Py2PMML from Zementis seems to be a commercial product. Besides this alternative there are no other tools to score Scikit models exported as PMML on Java/Hadoop. The core scikit team is planning to implement a PMML exporter though. But if you don't want any commercial solutions or wait for such tool to be implemented you still have some options but they require some coding:

Adapt the SKLearn Compiled trees project so it generates Java/MapReduce code instead of C.
Using the export_graphvizfunction obtain the DOT representation of each decision tree and write a small Java interpreter.
Forget about Java and Hadoop, use Apache Spark and evaluate each one of the decision trees in parallel using Python, Scikit and PySpark.

Hope it helps!

The export of SkLearn models to PMML can be handled by the JPMML-SkLearn (https://github.com/jpmml/jpmml-sklearn) library/command-line application now. It is much more robust and easier to work with than Py2PMML. — user1808924, Oct 15 '15 at 09:28

Exporting a Scikit Learn Random Forest for use on Hadoop Platform

1 Answers1