2

I'm trying to run a grid search for Gradient Boosting Machine in pyspark with H2O Sparkling Water.

Produced a reproducible example with the famous iris dataset.

from pysparkling import H2OContext, H2OConf
import pyspark
from pyspark.sql.types import StructType, StructField, FloatType, StringType
from pyspark.conf import SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setMaster("local").setAppName("test")
conf.set("spark.sql.shuffle.partitions", 3)
conf.set("spark.default.parallelism", 3)
conf.set("spark.debug.maxToStringFields", 100)
sc = pyspark.SparkContext(conf=conf)
sqlContext = SQLContext(sc)
hc = H2OContext.getOrCreate(sc, H2OConf(sc).set_internal_cluster_mode())
schema = StructType([
    StructField("sepal_length", FloatType(), True),
    StructField("sepal_width", FloatType(), True),
    StructField("petal_length", FloatType(), True),
    StructField("petal_width", FloatType(), True),
    StructField("class", StringType(), True)])
iris_df = sqlContext.read \
        .format('com.databricks.spark.csv') \
        .option('header', 'false') \
        .option('delimiter', ',') \
        .schema(schema) \
        .load('../../../../Downloads/iris.data')

If I try to follow this page of H2O docs and just translate to python

gbm_params = {'learnRate': [0.01, 0.1],
              'ntrees': [100 , 200, 300, 500]}
gbm_grid = H2OGridSearch()\
    .setLabelCol("class") \
    .setHyperParameters(gbm_params)\
    .setAlgo(H2OGBM().setMaxDepth(30))

model_pipeline = Pipeline().setStages([gbm_grid])
model = model_pipeline.fit(iris_df)

I get an internal NullPointerException, I guess there's something missing in the configuration.

Py4JJavaError: An error occurred while calling o111.fit.
: java.lang.NullPointerException
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.extractH2OParameters(H2OGridSearch.scala:352)
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.fit(H2OGridSearch.scala:64)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Unknown Source)

If I try to rewrite it in a different way, I get a different error,

gbm_grid = H2OGridSearch(algo=H2OGBM().setMaxDepth(30),
                         hyperParameters={'learnRate': [0.01, 0.1]},
                         withDetailedPredictionCol=True,
                         labelCol='class',
                         stoppingMetric="AUC")
model_pipeline = Pipeline().setStages([gbm_grid])
model = model_pipeline.fit(iris_df)

This is the output, no matter how do I change the hyperparameters,

Py4JJavaError: An error occurred while calling o1817.fit.
: java.lang.NoSuchFieldException: learnRate
    at java.lang.Class.getField(Unknown Source)
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.findField(H2OGridSearch.scala:170)
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.processHyperParams(H2OGridSearch.scala:154)
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.fit(H2OGridSearch.scala:71)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Unknown Source)

The following works, however it is not useful since there is no grid,

gbm_grid = H2OGridSearch(algo=H2OGBM().setMaxDepth(30),
                         #hyperParameters=gbm_params,
                         withDetailedPredictionCol=True,
                         labelCol='class',
                         stoppingMetric="AUC")
model_pipeline = Pipeline().setStages([gbm_grid])
model = model_pipeline.fit(iris_df)
model.stages[0].transform(iris_df).head()

And finally, just to be sure that learnRate is a parameter of H2OGBM, this also works,

gbm_model = H2OGBM(labelCol='class',
                   withDetailedPredictionCol=True).setLearnRate(0.01).setMaxDepth(5).setNtrees(100)

model_pipeline = Pipeline().setStages([gbm_model])
model = model_pipeline.fit(iris_df)
model.stages[0].transform(iris_df).head()

EDIT: missing imports

from pyspark.ml.pipeline import Pipeline
from ai.h2o.sparkling.ml.algos import H2OGridSearch
from ai.h2o.sparkling.ml.algos import H2OGBM

and sparking water version

h2o-pysparkling-2-4       3.28.0.1-1               pypi_0    pypi

EDIT after comments for Spark/H2O/Java versions

Spark: 2.4.4

H2O: 3.28.0.3

Java: 1.8.0_232


EDIT java -version

openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~16.04-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)

Get the same error if I use learn_rate instead of learnRate.

gbm_grid = H2OGridSearch(algo=H2OGBM().setMaxDepth(30),
                         hyperParameters={'learn_rate': [0.01, 0.1]},
                         withDetailedPredictionCol=True,
                         labelCol='class',
                         stoppingMetric="AUC")
model_pipeline = Pipeline().setStages([gbm_grid])
model = model_pipeline.fit(iris_df)

...

Py4JJavaError: An error occurred while calling o1376.fit.
: java.lang.NoSuchFieldException: learn_rate
    at java.lang.Class.getField(Class.java:1703)
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.findField(H2OGridSearch.scala:170)
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.processHyperParams(H2OGridSearch.scala:154)
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.fit(H2OGridSearch.scala:71)
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.fit(H2OGridSearch.scala:52)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
lrnzcig
  • 3,868
  • 4
  • 36
  • 50
  • @Irnzcig are you running this on a single dev laptop or are you using independent H2o shiny server and spark cluster? – Kristian Feb 08 '20 at 14:04
  • @Kristian I'm running on both. Obviously the test is run in a local machine, but I've also tried in a spark cluster. – lrnzcig Feb 08 '20 at 14:14
  • @Irnzcig , thanks, I wanted to be as close to the same set up as you. – Kristian Feb 08 '20 at 14:38
  • What version of H2O, Spark and Java are you using? Do H2O and Spark use same or different versions of Java on their respective servers? – Kristian Feb 08 '20 at 14:40
  • Spark 2.4 H2O 3.28 Java in the cluster I need to check, later today I'll connect – lrnzcig Feb 08 '20 at 15:01
  • The parameter is called [learn_rate](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/learn_rate.html) and not `learnRate`. – cronoik Feb 09 '20 at 10:43
  • Thanks @cronok. I've tried "learnRate", "learn_rate", "LearnRate", "lEaRnRaTe".... Thanks. – lrnzcig Feb 09 '20 at 11:59
  • @Kristian I edit the quesion with versions... however how do I check that H2O is using the same Java version as Spark? Thanks a lot! – lrnzcig Feb 09 '20 at 12:02
  • When you init h2o it should tell you what server you are connected to, what JAVA_HONE you are using, etc – Kristian Feb 09 '20 at 12:05
  • Thanks @Kristian. I'm probably making a mistake here since I don't tell anything about Java when I init h2o... I'll check this configuration then and let you know. Again thanks for the hint. – lrnzcig Feb 09 '20 at 12:10
  • @lrnzcig , sorry for the typo, it should have read JAVA_HOME – Kristian Feb 13 '20 at 12:17
  • @lrnzcig , in a terminal window type: java -version – Kristian Feb 13 '20 at 12:21
  • Sorry @Kristian. I've edited my question with `java -version`. Thanks. – lrnzcig Feb 13 '20 at 13:32
  • @lrnzcig: Can please update your question with `learn_rate` as parameter? The second error is explicitly complaining about the wrong parameter learnRate. – cronoik Feb 13 '20 at 19:06
  • Done. Thank you @cronoik. – lrnzcig Feb 13 '20 at 19:33
  • The parameter is called `hyper_params`. Can you please try the following: ` and ... `hyper_params= {'learn_rate':[0.01, 0.1], 'ntrees': [100, 200]}` ... – cronoik Feb 13 '20 at 22:26
  • Hi @cronoik, If I use `hyper_params` I get `AttributeError: 'H2OGridSearch' object has no attribute 'hyper_params'`. Thanks. – lrnzcig Feb 14 '20 at 07:11
  • And have you followed all steps in the 'Prepare the environment' section of the docs?:) – mirekphd Feb 15 '20 at 13:29

2 Answers2

0

Why not use a workaround and utilize H2O UI to create the grid? There's a checkbox to make your chosen parameter griddable, and you can supply the parameter values as a comma-separated list via the web form where you would normally put a single value.

mirekphd
  • 4,799
  • 3
  • 38
  • 59
0

There's a workaround here I did not notice (probably I should have posted it as a bug in github in the first place).

gbm_grid = H2OGridSearch(algo=H2OGBM().setMaxDepth(30),
                         hyperParameters={'_learn_rate':[0.01, 0.1], '_ntrees': [100, 200]},
                         withDetailedPredictionCol=True,
                         labelCol='class',
                         stoppingMetric="AUC")
model_pipeline = Pipeline().setStages([gbm_grid])
model = model_pipeline.fit(iris_df)
model.stages[0].transform(iris_df).head()
lrnzcig
  • 3,868
  • 4
  • 36
  • 50