0

Does anyone know whether one could use Adaboost with random forest base classifiers? I searched the web to learn more about this, but most webpages provided comparisons of the two as ensemble learning methods, and I didn't find anything about using them together.

(Basically somebody asked it here, but the only answer so far contradicts with my observations, which I'm sharing below)

Notwithstanding, nobody explicitly said there was anything wrong with it, so I tried it on a typical dataset with n rows of p real-valued features, and a label list of length n. In case it matters, they are embeddings of nodes in a graph obtained by the DeepWalk algorithm, and the nodes are categorized into two classes. I trained a few classification models on this data using 5-fold cross validation, and measured common evaluation metrics for them (precision, recall, AUC etc.). The models I have used are SVM, logistic regression, random Forest, 2-layer perceptron and Adaboost with random forest classifiers. The last model, Adaboost with random forest classifiers, yielded the best results. Sure, now the runtime has increased by a factor of, let's say, 100, but it's still about 20 mins, so it's not a constraint to me. Now I wonder if I should be suspicious about its good accuracy (95% AUC compared to multilayer perceptron's 89% and random forest's 88%)

Here's what I thought: Firstly, I'm using cross validation, so there's probably no overfitting flying under the radar. Secondly, both are ensemble learning methods, but random forest is a bagging method, wheras Adaboost is a boosting technique. So they're still different enough for their combination to make sense.

Am I wrong?

Esi
  • 1
  • 2
  • "*nobody explicitly said there was anything wrong with it*" - that's not true, I have explicitly explain in the linked answer that, at least **in theory**, Adaboost should be applied to *unstable* classifiers, which RF is not. Now, keep in mind that ML is largely an *empirical*, theory-poor domain, and the fact that in your case, with the specific data, you get better results is not a contradiction. In any case, this is not a *programming* question - see the NOTE here for alternative places to ask it: https://stackoverflow.com/tags/machine-learning/info – desertnaut Jun 06 '21 at 12:35
  • Notice also that AUC is *not* accuracy, neither it measures the performance of a *single* final model (it measures the performance of a model averaged across the range of all possible *classification thresholds*); see https://stackoverflow.com/a/58612125/4685471 – desertnaut Jun 06 '21 at 12:38
  • I'm aware that AUC is not accuracy, but that was what we were mainly interested in (because it's a better metric when using imbalanced data). Let me add accuracy numbers here: Adaboost with RF: 90.5, RF: 88.5, multilayer perceptron:86% (average over 5 folds) – Esi Jun 06 '21 at 12:58
  • Arguably, for imbalanced data you should check the F1 score (accuracy is indeed meaningless). The mentioned issues with AUC hold. – desertnaut Jun 06 '21 at 13:06

0 Answers0