I use targets
as a pipelining tool for an ML project with H2O
.
The main uniqueness of using H2O here is that it creates a new "cluster" (basically a new local process/server which communicates via Rest APIs as far as I understand).
The issue I am having is two-fold.
- How can I stop/operate the cluster within the targets framework in a smart way
- How can I save & load the data/models within the targets framework
MWE
A minimum working example I came up with looks like this (being the _targets.R
file):
library(targets)
library(h2o)
# start h20 cluster once _targets.R gets evaluated
h2o.init(nthreads = 2, max_mem_size = "2G", port = 54322, name = "TESTCLUSTER")
create_dataset_h2o <- function() {
# connect to the h2o cluster
h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
# convert the data to h2o dataframe
as.h2o(iris)
}
train_model <- function(hex_data) {
# connect to the h2o cluster
h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
h2o.randomForest(x = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
y = c("Species"),
training_frame = hex_data,
model_id = "our.rf",
seed = 1234)
}
predict_model <- function(model, hex_data) {
# connect to the h2o cluster
h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
h2o.predict(model, newdata = hex_data)
}
list(
tar_target(data, create_dataset_h2o()),
tar_target(model, train_model(data), format = "qs"),
tar_target(predict, predict_model(model, data), format = "qs")
)
This kinda works but faces the two issues I was outlying above and below...
Ad 1 - stopping the cluster
Usually I would out a h2o::h2o.shutdown(prompt = FALSE)
at the end of my script, but this does not work in this case.
Alternatively, I came up with a new target that is always run.
# in _targets.R in the final list
tar_target(END, h2o.shutdown(prompt = FALSE), cue = tar_cue(mode = "always"))
This works when I run tar_make()
but not when I use tar_visnetwork()
.
Another option is to use.
# after the h2o.init(...) call inside _targets.R
on.exit(h2o.shutdown(prompt = FALSE), add = TRUE)
Another alternative that I came up with is to handle the server outside of targets and only connect to it. But I feel that this might break the targets workflow...
Do you have any other idea how to handle this?
Ad 2 - saving the dataset and model
The code in the MWE does not save the data for the targets model
and predict
in the correct format (format = "qs"
). Sometimes (I think when the cluster gets restarted or so), the data gets "invalidated" and h2o throws an error. The data in h2o format in the R session is a pointer to the h2o dataframe (see also docs).
For keras, which similarly stores the models outside of R, there is the option format = "keras"
, which calls keras::save_model_hdf5()
behind the scenes. Similarly, H2O would require h2o::h2o.exportFile()
and h2o::h2o.importFile()
for the dataset and h2o::h2o.saveModel()
and h2o::h2o.loadModel()
for models (see also docs).
Is there a way to create additional formats for tar_targets
or do I need to write the data to file, and return the file? The downside to this is that this file is outside of the _targets
folder system, if I am not mistaken.