tl;df What setting in either R::ranger or h2o.ai::randomForest can account for the very different performances on the exact same data?
Background:
I'm trying to classify using a somewhat strongly imbalanced dataset, and the measure-of-goodness under consideration is Kappa (from caret). I have about 70k rows and about 400 columns, and about 99.3% are category "0" while about 0.7% are category "1".
Here is a snip of the ranger inputs:
est_ranger <- ranger(y~., data=df,
num.trees = 100,
max.depth = 20,
min.node.size = 5,
mtry = sqrt(ncol(df)) %>% round(),
splitrule = "gini",
sample.fraction = 0.632)
Here is a snip of the h2o.ai randomForest inputs:
est_h2o <- h2o.randomForest(x=2:ncol(df1), y=1,
training_frame = "df1.hex",
ntrees = 100,
max_depth = 20,
min_rows = 5,
sample_rate = 0.632,
mtries = sqrt(ncol(df1)) %>% round())
Note: I tried setting both of them to max depth 12 and it didn’t help. I tried sending both of them to max depth 20 and it didn’t change things. I tried setting to max depth null, and that didn’t help either.
When I run 10 loops of train-predict-evaluate, I get this the kappa-values for ranger:
> summary(perf_ranger)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2134 0.2261 0.2458 0.2410 0.2564 0.2633
And I get this for the kappa values for h2o.ai randomForest:
> summary(perf_h2o)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.5408 0.5575 0.6264 0.6182 0.6727 0.6922
To my eye it looks like kappa on the h2o.randomForest has a mean kappa that is around 2.56x higher than for ranger.
Question: What is h2o doing that ranger isnt?
Thoughts:
- there may be dynamic learning rate elements in h2o.ai
- there may be something about this "histogram" and "bins" in the h2o.ai
Update (23-Sept):
- tried using paa on the ecdf domain to artificially constrict the histograms, and that substantially reduced kappa for ranger. Conclusion there is that removing diversity in the columns impacts the performance of the system.
- tried forcing balanced classes (which some stats folks say is bad) and the kappa got much better (see below) for both of them. Also changed min rows to 1.
Here is for ranger:
summary(store1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.5113 0.5192 0.5252 0.5262 0.5299 0.5494
here is for h2o.ai:
summary(store2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9377 0.9512 0.9571 0.9550 0.9595 0.9662
The difference in mean kappa for imbalanced data is 0.377 while the one for balanced classes is 0.428. There is still a gap, but training on resampled data results in better test-set performance.
Ranger has 2 ways to balance classes, one is by resampling and one is by "weight" which I think (am wildly guessing) is related to computing the location of the best split.
Here is what ranger gives for weighting-driven balance of classes:
summary(store1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3491 0.3896 0.4051 0.4079 0.4381 0.4520
Here is what it gives for resampling-driven balance of classes:
summary(store1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.5170 0.5239 0.5310 0.5332 0.5425 0.5559
Here is what I get when they both are used:
summary(store1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.5113 0.5212 0.5275 0.5295 0.5393 0.5544
The first two don't overlap, and one is clearly better. When they are both used, there is very slight (potentially negligible) decrease compared to when only resampling is used, so without grid search and fine-tuning, it seems better to go with resampling-based balancing.
When I try using "extratrees" instead of "gini", a split rule that is not consistent with h2o, but one that approximates column sub-sampling, the summary goes up substantially:
summary(store1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.6267 0.6431 0.6500 0.6473 0.6535 0.6578
That is the best that I have at this point, but it is still speculation.