This is a very good question, and the answer proves to be not that straightforward.
Instinctively, most people would tend to recommend a Student's paired t-test; but, as explained in the excellent post Statistical Significance Tests for Comparing Machine Learning Algorithms of Machine Learning Mastery, this test is not actually suited for this case, as its assumptions are in fact violated:
In fact, this [Student's t-test] is a common way to compare
classifiers with perhaps hundreds of published papers using this
methodology.
The problem is, a key assumption of the paired Student’s t-test has
been violated.
Namely, the observations in each sample are not independent. As part
of the k-fold cross-validation procedure, a given observation will be
used in the training dataset (k-1) times. This means that the
estimated skill scores are dependent, not independent, and in turn
that the calculation of the t-statistic in the test will be
misleadingly wrong along with any interpretations of the statistic and
p-value.
The article goes on to recommend the McNemar's test (see also this, now closed, SO question), which is implemented in the statsmodels Python package. I will not pretend to know anything about it and I have never used it, so you might need to do a further digging by yourself here...
Nevertheless, as reported by the aforementioned post, a Student's t-test can be a "last resort" approach:
It’s an option, but it’s very weakly recommended.
and this is what I am going to demonstrate here; use it with caution.
To start with, you will need not only the averages, but the actual values of your performance metric in each one of the k-folds of your cross-validation. This is not exactly trivial in scikit-learn, but I have recently answered a relevant question on Cross-validation metrics in scikit-learn for each data split, and I will adapt the answer here using scikit-learn's Boston dataset and two decision tree regressors (you can certainly adapt these to your own exact case):
from sklearn.model_selection import KFold
from sklearn.datasets import load_boston
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
X, y = load_boston(return_X_y=True)
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True)
model_1 = DecisionTreeRegressor(max_depth = 4, criterion='mae',random_state=1)
model_2 = DecisionTreeRegressor(max_depth = 8, criterion='mae', random_state=1)
cv_mae_1 = []
cv_mae_2 = []
for train_index, val_index in kf.split(X):
model_1.fit(X[train_index], y[train_index])
pred_1 = model_1.predict(X[val_index])
err_1 = mean_absolute_error(y[val_index], pred_1)
cv_mae_1.append(err_1)
model_2.fit(X[train_index], y[train_index])
pred_2 = model_2.predict(X[val_index])
err_2 = mean_absolute_error(y[val_index], pred_2)
cv_mae_2.append(err_2)
cv_mae_1
contains the values of our metric (here mean absolute error - MAE) for each of the 5 folds of our 1st model:
cv_mae_1
# result:
[3.080392156862745,
2.8262376237623767,
3.164851485148514,
3.5514851485148515,
3.162376237623762]
and similarly cv_mae_2
for our 2nd model:
cv_mae_2
# result
[3.1460784313725494,
3.288613861386139,
3.462871287128713,
3.143069306930693,
3.2490099009900986]
Having obtained these lists, it is now straightforward to calculate the paired t-test statistic along with the corresponding p-value, using the respective method of scipy:
from scipy import stats
stats.ttest_rel(cv_mae_1,cv_mae_2)
# Ttest_relResult(statistic=-0.6875659723031529, pvalue=0.5295196273427171)
where, in our case, the huge p-value means that there is not a statistically significant difference between the means of our MAE metrics.
Hope this helps - do not hesitate to dig deeper by yourself...