kfold cv doesn't make the model more accurate per se. In your example with xgb, there are many hyper parameters eg(subsample, eta) to be specified, and to get a sense of how the parameters chosen perform on unseen data, we use kfold cv to partition the data into many training and test samples and measure out-of-sample accuracy.
We usually try this for several possible values of a parameter and what gives the lowest average error. After this you would refit your model with the parameters. This post and its answers discusses it.
For example, below we run something like what you did and we get only the train / test error for 1 set of values :
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=500,class_sep=0.7)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33, random_state=42)
data_dmatrix = xgb.DMatrix(data=X_train,label=y_train)
params = {'objective':'binary:logistic','eval_metric':'logloss',
'eta':0.01,
'subsample':0.1}
xgb_cv = xgb.cv(dtrain=data_dmatrix, params=params, nfold=5, metrics = 'logloss',seed=42)
train-logloss-mean train-logloss-std test-logloss-mean test-logloss-std
0 0.689600 0.000517 0.689820 0.001009
1 0.686462 0.001612 0.687151 0.002089
2 0.683626 0.001438 0.684667 0.003009
3 0.680450 0.001100 0.681929 0.003604
4 0.678269 0.001399 0.680310 0.002781
5 0.675170 0.001867 0.677254 0.003086
6 0.672349 0.002483 0.674432 0.004349
7 0.668964 0.002484 0.671493 0.004579
8 0.666361 0.002831 0.668978 0.004200
9 0.663682 0.003881 0.666744 0.003598
The last row is the result from last round, which is what we use for evaluation.
If we test over multiple values of eta
( and subsample
for example:
grid = pd.DataFrame({'eta':[0.01,0.05,0.1]*2,
'subsample':np.repeat([0.1,0.3],3)})
eta subsample
0 0.01 0.1
1 0.05 0.1
2 0.10 0.1
3 0.01 0.3
4 0.05 0.3
5 0.10 0.3
Normally we can use GridSearchCV for this, but below is something that uses xgb.cv:
def fit(x):
params = {'objective':'binary:logistic',
'eval_metric':'logloss',
'eta':x[0],
'subsample':x[1]}
xgb_cv = xgb.cv(dtrain=data_dmatrix, params=params,
nfold=5, metrics = 'logloss',seed=42)
return xgb_cv[-1:].values[0]
grid[['train-logloss-mean','train-logloss-std',
'test-logloss-mean','test-logloss-std']] = grid.apply(fit,axis=1,result_type='expand')
eta subsample train-logloss-mean train-logloss-std test-logloss-mean test-logloss-std
0 0.01 0.1 0.663682 0.003881 0.666744 0.003598
1 0.05 0.1 0.570629 0.012555 0.580309 0.023561
2 0.10 0.1 0.503440 0.017761 0.526891 0.031659
3 0.01 0.3 0.646587 0.002063 0.653741 0.004201
4 0.05 0.3 0.512229 0.008013 0.545113 0.018700
5 0.10 0.3 0.414103 0.012427 0.472379 0.032606
We can see for eta = 0.10
and subsample = 0.3
gives the best result, so next you just need to refit the model with these parameters:
xgb_reg = xgb.XGBRegressor(objective='binary:logistic',
eval_metric = 'logloss',
eta = 0.1,
subsample = 0.3)
xgb_reg.fit(X_train, y_train)