I am working on text classification and after the feature extraction step I got pretty matrices, for that reason I tried to use incremental learning as follows:
import xgboost as xgb
from sklearn.model_selection import ShuffleSplit, train_test_split
from sklearn.metrics import accuracy_score as acc
def incremental_learning2(X, y):
# split data into training and testing sets
# then split training set in half
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.1,
random_state=0)
X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_train,
y_train,
test_size=0.5,
random_state=0)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)
#params = {'objective': 'reg:linear', 'verbose': False}
params = {}
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')
# ================= train two versions of the model =====================#
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')
#Predictions
y_pred = model_2_v2.predict(X_test)
kfold = StratifiedKFold(n_splits=10, random_state=1).split(X_train, y_train)
scores = []
for k, (train, test) in enumerate(kfold):
model_2_v2.fit(X_train[train], y_train[train])
score = model_2_v2.score(X_train[test], y_train[test])
scores.append(score)
print('Fold: %s, Class dist.: %s, Acc: %.3f' % (k+1, np.bincount(y_train[train]), score))
print('\nCV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
With regards to the above code. I tried to to do a cross validation and predict some instances. However, it is not working. How can I fix the above code in order to get cross validated metrics and predictions after fitting and updating the GBM model on a very large dataset?.