Based on the question and subsequent answer here: When starting an h2o
instance running on a hadoop cluster, (with say hadoop jar h2odriver.jar -nodes 4 -mapperXmx 6g -output hdfsOutputDir
) the callback IP address used to connect to the h2o instance is selected by the hadoop runtime. So in most of the cases the IP address and the port is select by the Hadoop run time to find best available and looks like
....
H2O node 172.18.4.63:54321 reports H2O cluster size 4
H2O node 172.18.4.67:54321 reports H2O cluster size 4
H2O cluster (4 nodes) is up
(Note: Use the -disown option to exit the driver after cluster formation)
Open H2O Flow in your web browser: http://172.18.4.67:54321
Connection url output line: Open H2O Flow in your web browser: http://172.18.4.67:54321
The recommended way of using h2o
is to start and stop individual instances each time you want to use it (sorry, can't currently find the supporting documentation). The problem here is that if you want your python code to start up and connect to a h2o
instance automatically, it is not going to know what IP to connect to until the h2o
instance is already up and running. Thus, a common way to start H2O cluster on Hadoop is to let the Hadoop decide the cluster, then to parse the output for the line
Open H2O Flow in your web browser: x.x.x.x:54321
to get/extract the IP address.
The problem here is that h2o
is a blocking process who's output prints as a stream of text lines as the instance starts up rather than in bulk, this made it hard for me to get the final output line needed using basic python Popen logic to capture output. Is there a way to capture the output as it is being generated to get the line with the connection IP?