The following script, reproduces an equivalent problem as it was stated in h2o Help (Help -> View Example Flow
or Help -> Browse Installed packs.. -> examples -> Airlines Delay.flow
, download), but using h2o R-package and a fixed seed (123456
):
library(h2o)
# To use avaliable cores
h2o.init(max_mem_size = "12g", nthreads = -1)
IS_LOCAL_FILE = switch(1, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
# Convert to factor for classification
allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response])
# Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id="glm_model", seed=123456, training_frame=allyears2k.hex,
ignore_const_cols = T, y = response,
family="binomial", solver="IRLSM",
alpha=0.5,lambda=0.00001, lambda_search=F, standardize=T,
non_negative=F, score_each_iteration=F,
max_iterations=-1, link="family_default", intercept=T, objective_epsilon=0.00001,
beta_epsilon=0.0001, gradient_epsilon=0.0001, prior=-1, max_active_predictors=-1
)
# Analysis
confMatrix <- h2o.confusionMatrix(fit1)
print("Confusion Matrix for training dataset")
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
This is the Confusion Matrix for the training set:
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
NO YES Error Rate
NO 0 20887 1.000000 =20887/20887
YES 0 23091 0.000000 =0/23091
Totals 0 43978 0.474942 =20887/43978
And the metrics:
H2OBinomialMetrics: glm
** Reported on training data. **
MSE: 0.2473858
RMSE: 0.4973789
LogLoss: 0.6878898
Mean Per-Class Error: 0.5
AUC: 0.5550138
Gini: 0.1100276
R^2: 0.007965165
Residual Deviance: 60504.04
AIC: 60516.04
On contrary the result of h2o flow has a better performance:
and Confusion Matrix for max f1 threshold:
The h2o flow performance is much better than running the same algorithm using the equivalent R-package function.
Note: For sake of simplicity I am using Airlines Delay problem, that is a well-known problem using h2o, but I realized that such kind of significant difference are found in other similar situations using glm
algorithm.
Any thought about why these significant differences occur
Appendix A: Using default model parameters
Following the suggestion from @DarrenCook answer, just using default building parameters except for excluding columns and seed:
h2o flow
Now the buildModel
is invoked like this:
buildModel 'glm', {"model_id":"glm_model-default",
"seed":"123456","training_frame":"allyears2k.hex",
"ignored_columns":
["DayofMonth","DepTime","CRSDepTime","ArrTime","CRSArrTime","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
"CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed"],
"response_column":"IsDepDelayed","family":"binomial"
}
and the results are:
and the training metrics:
Running R-Script
The following script allows for an easy switch into default configuration (via IS_DEFAULT_MODEL
variable) and also keeping the configuration as it states in the Airlines Delay example:
library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores
IS_LOCAL_FILE = switch(2, FALSE, TRUE)
IS_DEFAULT_MODEL = switch(2, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
# Convert to factor for classification
allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response])
if (IS_DEFAULT_MODEL) {
fit1 <- h2o.glm(
x = predictors, model_id = "glm_model", seed = 123456,
training_frame = allyears2k.hex, y = response, family = "binomial"
)
} else { # Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id = "glm_model", seed = 123456, training_frame = allyears2k.hex,
ignore_const_cols = T, y = response,
family = "binomial", solver = "IRLSM",
alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T,
non_negative = F, score_each_iteration = F,
max_iterations = -1, link = "family_default", intercept = T, objective_epsilon = 0.00001,
beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1, max_active_predictors = -1
)
}
# Analysis
confMatrix <- h2o.confusionMatrix(fit1)
print("Confusion Matrix for training dataset")
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
It produces the following results:
MSE: 0.2473859
RMSE: 0.497379
LogLoss: 0.6878898
Mean Per-Class Error: 0.5
AUC: 0.5549898
Gini: 0.1099796
R^2: 0.007964984
Residual Deviance: 60504.04
AIC: 60516.04
Confusion Matrix (vertical: actual; across: predicted)
for F1-optimal threshold:
NO YES Error Rate
NO 0 20887 1.000000 =20887/20887
YES 0 23091 0.000000 =0/23091
Totals 0 43978 0.474942 =20887/43978
Some metrics are close, but the Confusion Matrix is quite diferent, the R-Script predict all flights as delayed.
Appendix B: Configuration
Package: h2o
Version: 3.18.0.4
Type: Package
Title: R Interface for H2O
Date: 2018-03-08
Note: I tested the R-Script also under 3.19.0.4231 with the same results
This is the cluster information after running the R:
> h2o.init(max_mem_size = "12g", nthreads = -1)
R is connected to the H2O cluster:
H2O cluster version: 3.18.0.4
...
H2O API Extensions: Algos, AutoML, Core V3, Core V4
R Version: R version 3.3.3 (2017-03-06)