H2O local server has died unexpectedly

Question

I am having a problem reproducing the AutoML tutorial in H2O documentation. After initatiing my h2o local server (h2o.init()) I get the following output, which sounds correct:

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_181"; Java(TM) SE Runtime Environment (build 1.8.0_181-b13); Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
  Starting server from /home/cdsw/.local/lib/python3.8/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp3nh32di4
  JVM stdout: /tmp/tmp3nh32di4/h2o_cdsw_started_from_python.out
  JVM stderr: /tmp/tmp3nh32di4/h2o_cdsw_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O_cluster_uptime: 01 secs
H2O_cluster_timezone:   Etc/UTC
H2O_data_parsing_timezone:  UTC
H2O_cluster_version:    3.32.1.3
H2O_cluster_version_age:    14 days, 20 hours and 29 minutes
H2O_cluster_name:   H2O_from_python_cdsw_cpcrap
H2O_cluster_total_nodes:    1
H2O_cluster_free_memory:    13.98 Gb
H2O_cluster_total_cores:    32
H2O_cluster_allowed_cores:  32
H2O_cluster_status: accepting new members, healthy
H2O_connection_url: http://127.0.0.1:54321
H2O_connection_proxy:   {"http": null, "https": null}
H2O_internal_security:  False
H2O_API_Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version: 3.8.5 final

Next, I import the datasets as specified by the tutorial:

# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

Finally, I train my AutoML model:

# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=x, y=y, training_frame=train)

That is when it crashes with following message:

AutoML progress: |██Failed polling AutoML progress log: Local server has died unexpectedly. RIP.
Job request failed Local server has died unexpectedly. RIP., will retry after 3s.
Job request failed Local server has died unexpectedly. RIP., will retry after 3s.

Have tried with different datasets, including some sample in case it was a memory issue but with no avail. The error prevails.

Anyone knows what should I do to fix this?

Much appreciated!

Regards.

Hi there, is the server dying outside of AutoML too? You can try a simple GBM example from here to check: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html#examples — Erin LeDell, Jun 03 '21 at 19:00
Thanks a lot Erin for your input. I have indeed tried your suggested example and it does run fine. Looks like it is something related to the AutoML process? — Lucas Bali, Jun 03 '21 at 20:36
An update, I have tried running it again, and it started failing like the AutoML, with the same message. Kind of erratic, not sure what is going on regarding the appearance of this error. — Lucas Bali, Jun 03 '21 at 20:42

score 2 · Answer 1 · answered Jun 04 '21 at 00:45

2

I think I was able to solve it. After some monitoring with the htop command I think the issue was actually a memory one. I restarted h2o limiting the memory to 1GB and 2 threads (maybe this is not strictly necessary) and I was able to run everything ok, as it seems.

h2o.init(max_mem_size="1G", nthreads=2)

Hope it helps to anyone who stumbles with the same problem.

answered Jun 04 '21 at 00:45

Lucas Bali

71
5

1

Something to note is that XGBoost (included in AutoML) uses memory outside of the H2O cluster. So when you use all the available memory on your machine with the H2O cluster, there's nothing left for XGBoost to use, and that might be what caused this issue. We have a ticket open to warn the user, which should lead to less confusion in the future: https://h2oai.atlassian.net/browse/PUBDEV-8095 – Erin LeDell Jun 04 '21 at 19:40
1

Also, you can probably leave nthreads = -1 (use all threads), and increase your memory size for the H2O cluster to about 2/3 of the total free RAM on your machine. I think your current config is a bit conservative, but that was a good way to try to identify the issue and fix it! – Erin LeDell Jun 04 '21 at 19:45

H2O local server has died unexpectedly

1 Answers1