1

I am experiencing what appears to be strange behavior while trying to set base/reference levels for a categorical variable in a simple glm model using H2O. To illustrate, I have added a few additional lines to the documentation example for the h2o.relevel function in R.

 ## Not run: 
 library(h2o)
 h2o.init()

 # Convert iris dataset to an H2OFrame
 iris_hf <- as.h2o(iris)
 # Look at current ordering of the Species column levels
 h2o.levels(iris_hf["Species"])
 # "setosa"     "versicolor" "virginica" 

 # fit glm
 h2o.glm("Species", "Sepal.Length", iris_hf)

enter image description here You can already see a problem here, because 'setosa' is supposed to be the reference level, but the glm is using 'versicolor' instead.

In base R, I would use 'relevel' to change the base level, and this works as expected with base R's glm function. In H2O, there is an equivalent h2o.relevel function. But as stated, this does not seem to influence the glm output in any way.

 # Change the reference level to "virginica"
 iris_hf["Species"] <- h2o.relevel(x = iris_hf["Species"], y = 
 "virginica")
 # Observe new ordering
 h2o.levels(iris_hf["Species"])
 # "virginica"  "setosa"     "versicolor"

 h2o.glm("Species", "Sepal.Length", iris_hf)

enter image description here

As can be seen, the ordering of the variable names changes in the output table, but there is no change in the actual parameter estimates.

The documentation for H2O that I have read implies that h2o.relevel should do what I am expecting, and that the h2o.glm function should, by default, use the first level of a factor as the reference level when estimating coefficients. This seems to not be the case though.

darkness
  • 75
  • 5

1 Answers1

1

Answering my own question here, it appears that setting lambda = 0 in the glm function is required to make this work as expected.

darkness
  • 75
  • 5