Reproducing the Airlines Delay h2o flow example with h2o package does not match

Question

The following script, reproduces an equivalent problem as it was stated in h2o Help (Help -> View Example Flow or Help -> Browse Installed packs.. -> examples -> Airlines Delay.flow, download), but using h2o R-package and a fixed seed (123456):

library(h2o)
# To use avaliable cores
h2o.init(max_mem_size = "12g", nthreads = -1)

IS_LOCAL_FILE = switch(1, FALSE, TRUE)
if (IS_LOCAL_FILE) {
    data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F)
    allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
    airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
    allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}

response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)

# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime",
    "TailNum", "ActualElapsedTime", "CRSElapsedTime",
    "AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
    "Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
    "WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
    "IsArrDelayed")

predictors <- setdiff(predictors, predictors.exc)
# Convert to factor for classification
allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response])

# Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
    x = predictors,
    model_id="glm_model", seed=123456, training_frame=allyears2k.hex,
    ignore_const_cols = T, y = response,
    family="binomial", solver="IRLSM",
    alpha=0.5,lambda=0.00001, lambda_search=F, standardize=T,
    non_negative=F, score_each_iteration=F,
    max_iterations=-1, link="family_default", intercept=T, objective_epsilon=0.00001,
    beta_epsilon=0.0001, gradient_epsilon=0.0001, prior=-1, max_active_predictors=-1
)
# Analysis
confMatrix <- h2o.confusionMatrix(fit1)
print("Confusion Matrix for training dataset")
print(confMatrix)
print(summary(fit1))
h2o.shutdown()

This is the Confusion Matrix for the training set:

 Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
       NO   YES    Error          Rate
NO      0 20887 1.000000  =20887/20887
YES     0 23091 0.000000      =0/23091
Totals  0 43978 0.474942  =20887/43978

And the metrics:

H2OBinomialMetrics: glm
** Reported on training data. **

MSE:  0.2473858
RMSE:  0.4973789
LogLoss:  0.6878898
Mean Per-Class Error:  0.5
AUC:  0.5550138
Gini:  0.1100276
R^2:  0.007965165
Residual Deviance:  60504.04
AIC:  60516.04

On contrary the result of h2o flow has a better performance:

and Confusion Matrix for max f1 threshold:

The h2o flow performance is much better than running the same algorithm using the equivalent R-package function.

Note: For sake of simplicity I am using Airlines Delay problem, that is a well-known problem using h2o, but I realized that such kind of significant difference are found in other similar situations using glm algorithm.

Any thought about why these significant differences occur

Appendix A: Using default model parameters

Following the suggestion from @DarrenCook answer, just using default building parameters except for excluding columns and seed:

h2o flow

Now the buildModel is invoked like this:

buildModel 'glm', {"model_id":"glm_model-default",
  "seed":"123456","training_frame":"allyears2k.hex",
  "ignored_columns": 
     ["DayofMonth","DepTime","CRSDepTime","ArrTime","CRSArrTime","TailNum",
      "ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
      "TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
      "CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
      "LateAircraftDelay","IsArrDelayed"],
   "response_column":"IsDepDelayed","family":"binomial"

}

and the results are:

and the training metrics:

Running R-Script

The following script allows for an easy switch into default configuration (via IS_DEFAULT_MODEL variable) and also keeping the configuration as it states in the Airlines Delay example:

library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores

IS_LOCAL_FILE    = switch(2, FALSE, TRUE)
IS_DEFAULT_MODEL = switch(2, FALSE, TRUE)
if (IS_LOCAL_FILE) {
    data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F)
    allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
    airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
    allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}

response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)

# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime",
    "TailNum", "ActualElapsedTime", "CRSElapsedTime",
    "AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
    "Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
    "WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
    "IsArrDelayed")

predictors <- setdiff(predictors, predictors.exc)
# Convert to factor for classification
allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response])

if (IS_DEFAULT_MODEL) {
    fit1 <- h2o.glm(
        x = predictors, model_id = "glm_model", seed = 123456,
        training_frame = allyears2k.hex, y = response, family = "binomial"
    )
} else { # Copied and pasted from the flow, then converting to R syntax
    fit1 <- h2o.glm(
        x = predictors,
        model_id = "glm_model", seed = 123456, training_frame = allyears2k.hex,
        ignore_const_cols = T, y = response,
        family = "binomial", solver = "IRLSM",
        alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T,
        non_negative = F, score_each_iteration = F,
        max_iterations = -1, link = "family_default", intercept = T, objective_epsilon = 0.00001,
        beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1, max_active_predictors = -1
    )
}

# Analysis
confMatrix <- h2o.confusionMatrix(fit1)
print("Confusion Matrix for training dataset")
print(confMatrix)
print(summary(fit1))
h2o.shutdown()

It produces the following results:

MSE:  0.2473859
RMSE:  0.497379
LogLoss:  0.6878898
Mean Per-Class Error:  0.5
AUC:  0.5549898
Gini:  0.1099796
R^2:  0.007964984
Residual Deviance:  60504.04
AIC:  60516.04

Confusion Matrix (vertical: actual; across: predicted) 
for F1-optimal threshold:
       NO   YES    Error          Rate
NO      0 20887 1.000000  =20887/20887
YES     0 23091 0.000000      =0/23091
Totals  0 43978 0.474942  =20887/43978

Some metrics are close, but the Confusion Matrix is quite diferent, the R-Script predict all flights as delayed.

Appendix B: Configuration

Package: h2o
Version: 3.18.0.4
Type: Package
Title: R Interface for H2O
Date: 2018-03-08

Note: I tested the R-Script also under 3.19.0.4231 with the same results

This is the cluster information after running the R:

> h2o.init(max_mem_size = "12g", nthreads = -1)

R is connected to the H2O cluster: 
H2O cluster version:        3.18.0.4 
...
H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
R Version:                  R version 3.3.3 (2017-03-06)

This question is really long and it's going to be really hard for someone to answer. Can you simplify it? Try it without cross-validation. Do a simple test/train split (say 70/30) with `sample.int`. Ensure that the training data you're using is identical. Then check that the model parameterizations are identical. If this is true, then there's possibly something that H2O Flow is doing with grid search that you're not replicating in the R API. — C8H10N4O2, Mar 16 '18 at 13:43
P.S. Something's obviously not right with how you're building the model if you have a fairly balanced dataset with useful predictors but your predicted class never changes. I would simplify the question to: why is my R script returning a constant class prediction? — C8H10N4O2, Mar 16 '18 at 13:45
Fot this to question to be actionable, you should add an R script that runs end-to-end, including the h2o.importFile("http:// ... blah.csv") and the printing of the confusion matrix with all 0's. — TomKraljevic, Mar 16 '18 at 14:23
I will simplify the question and use exactly the same Airline Delay example from the h2o help (train and validation set). @TomKraljevic I got some error trying to download the file using `h2o.importFile("http:// ... blah.csv")` probably a proxy/security configuration on my computer at work, because the link is valid I have downloaded the file clicking on the link. @C8H10N4O2 I am using `h2o.splitFrame` because I assume is the same split strategy as h2o flow, but I don't know. — David Leal, Mar 16 '18 at 15:32
@C8H10N4O2 simplified the question and the problem sample, now it works exactly as it is in the Airlines Delay example from h2o flow but using a fixed seed. The problem persists: The solution does not match. I hope with this change you can unmark the question "eligible for bounty". If it is a real issue, I think it is relevant to have a question that shows the problem. — David Leal, Mar 16 '18 at 18:39
Is your question that you are getting a confusion matrix with all YESes from R, but not from Flow? Or is you question that you get very similar, but slightly different models from R and Flow, even though the seed is the same? Your question currently feels like a mix of both those, and is quite confusing, so can you delete the bits that are no longer relevant? Just leave the minimum information we need to reproduce the problem. — Darren Cook, Mar 19 '18 at 21:06

score 2 · Answer 1 · answered Mar 17 '18 at 10:21

2

Troubleshooting Tip: build the all-defaults model first:

mDef = h2o.glm(predictors, response, allyears2k.hex, family="binomial")

This takes 2 seconds and gives almotst exactly the same AUC and confusion matrix as in your Flow screenshots.

So, we now know the problem you see is due to all the model customization you have done...

...except when I build your fit1 I get basically the same results as my default model:

         NO   YES    Error          Rate
NO     4276 16611 0.795279  =16611/20887
YES    1573 21518 0.068122   =1573/23091
Totals 5849 38129 0.413479  =18184/43978

This was using your script exactly as given, so it fetched the remote csv file. (Oh, I removed the max_mem_size argument, as I don't have 12g on this notebook!)

Assuming you can get exactly your posted results, running exactly the code you posted (and in a fresh R session, with a newly started H2O cluster), one possible explanation is you are using 3.19.x, but the latest stable release is 3.18.0.2? (My test was with 3.14.0.1)

answered Mar 17 '18 at 10:21

Darren Cook

27,837
13
117
217

@DarreenCook, Now have installed the 3.18 h2o R-package and the result running R-script differs from h2o flow. I followed your suggestion, I ran the script with default parameters and I got different results (even using the same seed). I will post an update to my question for more details. My understanding is that we should be able to reproduce the same experiment (not just similar) when using the same seed in both cases. – David Leal Mar 19 '18 at 17:41
@DavidLeal Your question shows a confusion matrix where it has always guessed "yes". If that is not what you are seeing when running the code you posted, then, yes, editing the question (or starting a new question) is a good idea! – Darren Cook Mar 19 '18 at 19:02
I understand the question is valid so far because I am not able to reproduce the same result from h2o flow into R-package. I will try to verify this issue using another set of data if that is the case, then I will post another specific question – David Leal Mar 19 '18 at 21:39

David Leal · Accepted Answer · 2018-03-26T13:00:35.267

Finally, I guess this is the explanation: both have the same parameter configuration for building the model (that is not the problem), but the H2o flow uses a specific parsing customization converting some variables values into Enum, that the R-script did not specify.

The Airlines Delay problem how it was specified in the h2o Flow example uses as predictor variables (the flow defines the ignored_columns):

"Year", "Month", "DayOfWeek", "UniqueCarrier", 
   "FlightNum", "Origin", "Dest", "Distance"

Where all of the predictors should be parsed as: Enum except Distance. Therefore the R-Script needs to convert such columns from numeric or char into factor.

Executing using h2o R-package

Here the R-Script updated:

library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores

IS_LOCAL_FILE    = switch(2, FALSE, TRUE)
IS_DEFAULT_MODEL = switch(2, FALSE, TRUE)
if (IS_LOCAL_FILE) {
    data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = T)
    allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
    airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
    allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}

response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)

# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", 
    "ArrTime", "CRSArrTime",
    "TailNum", "ActualElapsedTime", "CRSElapsedTime",
    "AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
    "Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
    "WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
    "IsArrDelayed")

predictors <- setdiff(predictors, predictors.exc)
column.asFactor <- c("Year", "Month", "DayofMonth", "DayOfWeek", 
    "UniqueCarrier",  "FlightNum", "Origin", "Dest", response)
# Coercing as factor (equivalent to Enum from h2o Flow)
# Note: Using lapply does not work, see the answer of this question
# https://stackoverflow.com/questions/49393343/how-to-coerce-multiple-columns-to-factors-at-once-for-h2oframe-object
for (col in column.asFactor) {
    allyears2k.hex[col] <- as.factor(allyears2k.hex[col])
}

if (IS_DEFAULT_MODEL) {
    fit1 <- h2o.glm(x = predictors, y = response, 
       training_frame = allyears2k.hex,
       family = "binomial", seed = 123456
    )
} else { # Copied and pasted from the flow, then converting to R syntax
    fit1 <- h2o.glm(
        x = predictors,
        model_id = "glm_model", seed = 123456, 
        training_frame = allyears2k.hex,
        ignore_const_cols = T, y = response,
        family = "binomial", solver = "IRLSM",
        alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T,
        non_negative = F, score_each_iteration = F,
        max_iterations = -1, link = "family_default", intercept = T,
        objective_epsilon = 0.00001,
        beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1,
        max_active_predictors = -1
    )
}

# Analysis
print("Confusion Matrix for training dataset")
confMatrix <- h2o.confusionMatrix(fit1)
print(confMatrix)
print(summary(fit1))
h2o.shutdown()

Here the result running the R-Script under default configuraiton IS_DEFAULT_MODEL=T:

H2OBinomialMetrics: glm
** Reported on training data. **

MSE:                   0.2001145
RMSE:                  0.4473416
LogLoss:               0.5845852
Mean Per-Class Error:  0.3343562
AUC:                   0.7570867
Gini:                  0.5141734
R^2:                   0.1975266
Residual Deviance:     51417.77
AIC:                   52951.77

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
          NO   YES    Error          Rate
NO     10337 10550 0.505099  =10550/20887
YES     3778 19313 0.163614   =3778/23091
Totals 14115 29863 0.325799  =14328/43978

Executing under h2o flow

Now executing the flow: Airlines_Delay_GLMFixedSeed, we can obtain the same results. Here the detail about the flow configuration:

The parseFiles function:

parseFiles
  paths: ["https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"]
  destination_frame: "allyears2k.hex"
  parse_type: "CSV"
  separator: 44
  number_columns: 31
  single_quotes: false
  column_names: 
  ["Year","Month","DayofMonth","DayOfWeek","DepTime","CRSDepTime","ArrTime",
   "CRSArrTime","UniqueCarrier","FlightNum","TailNum","ActualElapsedTime",
   "CRSElapsedTime","AirTime","ArrDelay","DepDelay","Origin","Dest",
   "Distance","TaxiIn","TaxiOut","Cancelled","CancellationCode",
   "Diverted","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
   "LateAircraftDelay","IsArrDelayed",
   "IsDepDelayed"]
  column_types ["Enum","Enum","Enum","Enum","Numeric","Numeric",
   "Numeric","Numeric", "Enum","Enum","Enum","Numeric",
   "Numeric", "Numeric","Numeric","Numeric",
   "Enum","Enum","Numeric","Numeric","Numeric",
   "Enum","Enum","Numeric","Numeric","Numeric",
   "Numeric","Numeric","Numeric","Enum","Enum"]
  delete_on_done: true
  check_header: 1
  chunk_size: 4194304

where the following predictor columns are converted to Enum: "Year", "Month", "DayOfWeek", "UniqueCarrier", "FlightNum", "Origin", "Dest"

Now invoking the buildModel function as follows, using the default parameters except for ignored_columns and seed:

 buildModel 'glm', {"model_id":"glm_model-default","seed":"123456",
  "training_frame":"allyears2k.hex",
  "ignored_columns":["DayofMonth","DepTime","CRSDepTime","ArrTime",
  "CRSArrTime","TailNum",
  "ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
  "TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
  "CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
  "LateAircraftDelay","IsArrDelayed"],"response_column":"IsDepDelayed",
  "family":"binomial"}

and finally we get the following result:

and Training Output Metrics:

model                   glm_model-default
model_checksum          -2438376548367921152
frame                   allyears2k.hex
frame_checksum          -2331137066674151424
description             ·
model_category          Binomial
scoring_time            1521598137667
predictions             ·
MSE                     0.200114
RMSE                    0.447342
nobs                    43978
custom_metric_name      ·
custom_metric_value     0
r2                      0.197527
logloss                 0.584585
AUC                     0.757084
Gini                    0.514168
mean_per_class_error    0.334347
residual_deviance       51417.772427
null_deviance           60855.951538
AIC                     52951.772427
null_degrees_of_freedom 43977
residual_degrees_of_freedom 43211

Comparing both results

The training metrics are almost the same for first 4-significant digits:

                       R-Script   H2o Flow
MSE:                   0.2001145  0.200114
RMSE:                  0.4473416  0.447342
LogLoss:               0.5845852  0.584585
Mean Per-Class Error:  0.3343562  0.334347
AUC:                   0.7570867  0.757084
Gini:                  0.5141734  0.514168
R^2:                   0.1975266  0.197527
Residual Deviance:     51417.77   51417.772427
AIC:                   52951.77   52951.772427

Confusion Matrix is slightly different:

          TP     TN    FP    FN   
R-Script  10337  19313 10550 3778
H2o Flow  10341  19309 10546 3782

          Error
R-Script  0.325799  
H2o Flow  0.3258

My understanding is that the difference are withing the acceptable threshold (around 0.0001), therefore we can say that both interfaces provide the same result.

Reproducing the Airlines Delay h2o flow example with h2o package does not match

2 Answers2