Tools used:
- Spark 2
- Sparkling Water (H2O)
- Zeppeling notebook
- Pyspark Code
I'm starting H2O in INTERNAL mode from my Zeppelin notebook, since my environment is YARN. I'm using the basic command:
from pysparkling import *
hc = H2OContext.getOrCreate(spark)
import h2o
My problem is that I have the zeppelin server installed on a weak machine and when I run my code FROM ZEPPELIN the H2O cluster starts on that machine using its IP automatically. The driver runs on there and i'm limited by the driver memory which H2O consumes. I have 4 strong worker node machines with 100GB and many cores and the cluster uses them while I run my models, but I would like the H2O cluster to start on one of these worker machines and run the driver there, but I didn't find a way to force H2O to do that.
I wonder if there is a solution, or if I must install the zeppelin server on a worker machine.
Help will be appreciated if a solution is possible