Pyspark wrapper for H2O POJO

Question

I created model using H2O's Sparkling Water. And now I'd like to apply it to huge Spark DF (populated with sparse vectors). I use python and pyspark, pysparkling. Basically I need to do map job with model.predict() function inside. But copying data into H2O context is huge overhead and not an option. What I think I gonna do is, extract POJO (Java class) model from h2o model and use it to do map in dataframe. My questions are:

Is there a better way?
How to write pyspark wrapper for java class, from which I intend to use only one method .score(double[] data, double[] result)
How to maximally reuse wrappers from Spark ML library?

Thank you!

score 2 · Answer 1 · answered Mar 07 '16 at 21:23

In this case, you can:

1) use h2o.predict(H2OFrame) method to generate prediction, but you need to transform RDD to H2OFrame. It is not the perfect solution...however, for some cases, it can provide reasonable solution.

2) switch to JVM and call JVM directly via Spark's Py4J gateway This is not fully working solution right now, since the method score0 needs to accept non-primitive types on H2O side and also to be visible (right now it is protected), but at least idea:

model = sc._jvm.water.DKV.getGet("deeplearning.model")
double_class = sc._jvm.double
row = sc._gateway.new_array(double_class, nfeatures)
row[0] = ...
...
row[nfeatures-1] = ...
prediction = model.score0(row)

I created JIRA improvement for this case https://0xdata.atlassian.net/browse/PUBDEV-2726

However, workaround is to create a Java wrapper around model which would expose right shape of score0 function:

class ModelWrapper extends Model {
   public double[] score(double[] row) {
     return score0(row)
   }
}

Please see also hex.ModelUtils: https://github.com/h2oai/sparkling-water/blob/master/core/src/main/scala/hex/ModelUtils.scala (again you can call them directly via Py4J gateway exposed by Spark)

One more way: you can download Model as code, compile it, deploy it and then call directly from Python via Py4J. — Michal, Mar 07 '16 at 21:24
Thank you Michal, solution 2 looks simple enough (I have compiled model in jar). But I cant reference spark context within map operation. How to deal with that? — USER, Mar 07 '16 at 22:13
@USER you are right - i forgot that `sc` is not accessible in rdd operations (this is good post with possible alternatives: http://stackoverflow.com/questions/31684842/how-to-use-java-scala-function-from-an-action-or-a-transformation) For now, i do not have good answer. We need to somehow figure out this limitations. — Michal, Mar 11 '16 at 20:43

Pyspark wrapper for H2O POJO

1 Answers1