Compare whether the difference between performance accuracy of 2 ML models is Statistically Significant or Not

Question

This is my first time using stack exchange but I need help with a problem (It is not a homework or assignment problem):

I have two Decision Trees: D1 = DecisionTreeClassifier(max_depth=4,criterion = 'entropy',random_state=1) and D2 = DecisionTreeClassifier(max_depth=8,criterion = 'entropy',random_state=1). When I performed 5 fold cross validation on both of them for a given set of features and corresponding labels, I found their average validation accuracy over the 5 folds as 0.59 and 0.57 respectively. How do I determine whether the difference between their performances is statistically significant or not? (P.S. we're to use significance level = 0.01).

Do state if any information or term of significance is missing here.

desertnaut · Accepted Answer · 2019-02-03T01:08:49.537

This is a very good question, and the answer proves to be not that straightforward.

Instinctively, most people would tend to recommend a Student's paired t-test; but, as explained in the excellent post Statistical Significance Tests for Comparing Machine Learning Algorithms of Machine Learning Mastery, this test is not actually suited for this case, as its assumptions are in fact violated:

In fact, this [Student's t-test] is a common way to compare classifiers with perhaps hundreds of published papers using this methodology.

The problem is, a key assumption of the paired Student’s t-test has been violated.

Namely, the observations in each sample are not independent. As part of the k-fold cross-validation procedure, a given observation will be used in the training dataset (k-1) times. This means that the estimated skill scores are dependent, not independent, and in turn that the calculation of the t-statistic in the test will be misleadingly wrong along with any interpretations of the statistic and p-value.

The article goes on to recommend the McNemar's test (see also this, now closed, SO question), which is implemented in the statsmodels Python package. I will not pretend to know anything about it and I have never used it, so you might need to do a further digging by yourself here...

Nevertheless, as reported by the aforementioned post, a Student's t-test can be a "last resort" approach:

It’s an option, but it’s very weakly recommended.

and this is what I am going to demonstrate here; use it with caution.

To start with, you will need not only the averages, but the actual values of your performance metric in each one of the k-folds of your cross-validation. This is not exactly trivial in scikit-learn, but I have recently answered a relevant question on Cross-validation metrics in scikit-learn for each data split, and I will adapt the answer here using scikit-learn's Boston dataset and two decision tree regressors (you can certainly adapt these to your own exact case):

from sklearn.model_selection import KFold
from sklearn.datasets import load_boston
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

X, y = load_boston(return_X_y=True)
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True)
model_1 = DecisionTreeRegressor(max_depth = 4, criterion='mae',random_state=1)
model_2 = DecisionTreeRegressor(max_depth = 8, criterion='mae', random_state=1)

cv_mae_1 = []
cv_mae_2 = []

for train_index, val_index in kf.split(X):
    model_1.fit(X[train_index], y[train_index])
    pred_1 = model_1.predict(X[val_index])
    err_1 = mean_absolute_error(y[val_index], pred_1)
    cv_mae_1.append(err_1)

    model_2.fit(X[train_index], y[train_index])
    pred_2 = model_2.predict(X[val_index])
    err_2 = mean_absolute_error(y[val_index], pred_2)
    cv_mae_2.append(err_2)

cv_mae_1 contains the values of our metric (here mean absolute error - MAE) for each of the 5 folds of our 1st model:

cv_mae_1
# result:
[3.080392156862745,
 2.8262376237623767,
 3.164851485148514,
 3.5514851485148515,
 3.162376237623762]

and similarly cv_mae_2 for our 2nd model:

cv_mae_2
# result
[3.1460784313725494,
 3.288613861386139,
 3.462871287128713,
 3.143069306930693,
 3.2490099009900986]

Having obtained these lists, it is now straightforward to calculate the paired t-test statistic along with the corresponding p-value, using the respective method of scipy:

from scipy import stats
stats.ttest_rel(cv_mae_1,cv_mae_2)
# Ttest_relResult(statistic=-0.6875659723031529, pvalue=0.5295196273427171)

where, in our case, the huge p-value means that there is not a statistically significant difference between the means of our MAE metrics.

Hope this helps - do not hesitate to dig deeper by yourself...

Regrets for the delay in response, but your post was increadibly helpful. I used the paired t-test statistic to ascertain the statistical significance of the difference between the means of my models and so far they have proven to be quite effective. — gpradhan, Feb 04 '19 at 14:29
@desertnaut thanks for the links and answer. A follow-up question: What if I have more than 2 models to compare ? I actually have 4 models and I want to see which models are significantly better/worse than the others ? In this case, should I repeat the same experiment for each possible pair ? (i.e. compare model1-model2, model1-model3, model1-model4, model2-model3 and so on) — zwlayer, Mar 02 '20 at 09:17

Compare whether the difference between performance accuracy of 2 ML models is Statistically Significant or Not

1 Answers1