Force H2O Sparkling Water cluster to start on a specific machine in YARN mode

Question

Tools used:

Spark 2
Sparkling Water (H2O)
Zeppeling notebook
Pyspark Code

I'm starting H2O in INTERNAL mode from my Zeppelin notebook, since my environment is YARN. I'm using the basic command:

from pysparkling import *
hc = H2OContext.getOrCreate(spark)
import h2o

My problem is that I have the zeppelin server installed on a weak machine and when I run my code FROM ZEPPELIN the H2O cluster starts on that machine using its IP automatically. The driver runs on there and i'm limited by the driver memory which H2O consumes. I have 4 strong worker node machines with 100GB and many cores and the cluster uses them while I run my models, but I would like the H2O cluster to start on one of these worker machines and run the driver there, but I didn't find a way to force H2O to do that.

I wonder if there is a solution, or if I must install the zeppelin server on a worker machine.

Help will be appreciated if a solution is possible

score 0 · Answer 1 · answered Mar 27 '18 at 11:04

0

Start your job in yarn-cluster mode. This will make the driver run as another YARN container.

Here is another stackoverflow post describing the difference:

Spark yarn cluster vs client - how to choose which one to use?

answered Mar 27 '18 at 11:04

TomKraljevic

3,661
11
14

Force H2O Sparkling Water cluster to start on a specific machine in YARN mode

1 Answers1