1

I am running h2o grid search on R. The model is a glm using a gamma distribution. I have defined the grid using the following settings. hyper_parameters = list(alpha = c(0, .5), missing_values_handling = c("Skip", "MeanImputation"))

                                 h2o.grid(algorithm = "glm",                            # Setting algorithm type
                                 grid_id = "grid.s",                                    # Id so retrieving information on iterations will be easier later
                                 x = predictors,                                        # Setting predictive features
                                 y = response,                                          # Setting target variable
                                 training_frame = data,                                 # Setting training set
                                 validation_frame = validate,                           # Setting validation frame
                                 hyper_params = hyper_parameters,                       # Setting apha values for iterations
                                 remove_collinear_columns = T,                          # Parameter to remove collinear columns
                                 lambda_search = T,                                     # Setting parameter to find optimal lambda value
                                 seed = 1234,                                           # Setting to ensure replicateable results
                                 keep_cross_validation_predictions = F,                 # Setting to save cross validation predictions
                                 compute_p_values = F,                                  # Calculating p-values of the coefficients
                                 family = 'gamma',                                      # Distribution type used
                                 standardize = T,                                       # Standardizing continuous variables
                                 nfolds = 2,                                            # Number of cross-validations
                                 fold_assignment = "Modulo",                            # Specifying fold assignment type to use for cross validations
                                 link = "log") 

When i run the above script, i get the following error: Error in hyper_names[[index2]] : subscript out of bounds

Please can you help me find where the error is

  • Please provide a [minimal reproducible script](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with an example dataset from R or h2o – phiver Jul 13 '18 at 15:11

1 Answers1

0

As disucssed in the comments it is difficult to tell what the cause for the error could be without sample data and code. The out-of-bounds error could be because the code is trying to access a value that does not exist in the input. So possibly, it could be either of the inputs to the h2o.grid(). I would check columns and rows in the train and validation data sets. The hyperparameters from the question run fine with family="binomial".

The code below runs fine with glm(). I have made several assumptions such as: (1) family=binomial instead of family=gamma was used based on sample data created, (2) response y is binary, (3) train and test split ratio, (4) number of responses are limited to three predictors or independent variables (x1, x2, x3), (5) one binary response variable (y`).

Import libraries

library(h2o)
library(h2oEnsemble)

Create sample data

x1 <- abs(100*rnorm(100))
x2 <- 10+abs(100*rnorm(100))
x3 <- 100+abs(100*rnorm(100))
#y <- ronorm(100)
y <- floor(runif(100,0,1.5))
df <- data.frame(x1, x2, x3,y)
df$y <-  ifelse(df$y==1, 'yes', 'no')
df$y <- as.factor(df$y)
head(df)

Initialize h2o

h2o.init()

Prepare data in required h2o format

df <- as.h2o(df)
y <- "y"
x <- setdiff( names(df), y )
df<- df[ df$y %in% c("no", "yes"), ]
h2o.setLevels(df$y, c("no","yes") )

# Split data into train and validate sets
data <- h2o.splitFrame( df, ratios = c(.6, 0.15) )
names(data) <- c('train', 'valid', 'test')
data$train

Set parameters

grid_id <- 'glm_grid'
hyper_parameters <- list( alpha = c(0, .5, 1),
                          lambda = c(1, 0.5, 0.1, 0.01),
                          missing_values_handling = c("Skip", "MeanImputation"),
                          tweedie_variance_power = c(0, 1, 1.1,1.8,1.9,2,2.1,2.5,2.6,3, 5, 7),
                          #tweedie_variance_power = c(0, 1, 1.1,1.8,1.9,2,2.1,2.5,2.6,3, 5, 7),
                          seed = 1234

)

Fit h2o.grid()

h2o.grid(
  algorithm = "glm", 
  #grid_id = grid_id,
  hyper_params = hyper_parameters,
  training_frame = data$train, 
  validation_frame = data$valid, 
  x = x, 
  y = y,
  lambda_search = TRUE,
  remove_collinear_columns = T,
  keep_cross_validation_predictions = F,
  compute_p_values = F,
  standardize = T,
  nfolds = 2,
  fold_assignment = "Modulo",

  family = "binomial"     
)

Output

enter image description here

Nilesh Ingle
  • 1,777
  • 11
  • 17