I created model using H2O's Sparkling Water. And now I'd like to apply it to huge Spark DF (populated with sparse vectors). I use python and pyspark, pysparkling. Basically I need to do map job with model.predict() function inside. But copying data into H2O context is huge overhead and not an option. What I think I gonna do is, extract POJO (Java class) model from h2o model and use it to do map in dataframe. My questions are:
- Is there a better way?
- How to write pyspark wrapper for java class, from which I intend to use only one method .score(double[] data, double[] result)
- How to maximally reuse wrappers from Spark ML library?
Thank you!