I try to optimize my hyperparameters of my XGBoost Ranker model, but I can't
Here is what my table (df on code) looks like :
query | relevance | features |
---|---|---|
1 | 5 | 5.4.7.... |
1 | 3 | 6........ |
2 | 5 | 3........ |
2 | 3 | 8........ |
3 | 2 | 1........ |
Then I split my table on train test with on the test table only one query:
gss = GroupShuffleSplit(test_size=1, n_splits=1,).split(df, groups=df['query'])
X_train_inds, X_test_inds = next(gss)
train_data= df.iloc[X_train_inds]
X_train=train_data.drop(columns=["relevance"])
Y_train=train_data.relevance
test_data= df.iloc[X_test_inds]
X_test=test_data.drop(columns=["relevance"])
Y_test=test_data.relevance
and constitute groups which is the number of lines by query:
groups = train_data.groupby('query').size().to_frame('size')['size'].to_numpy()
And then I run my model and try to optimize the hyperparameters with a RandomizedSearchCV:
param_dist = {'n_estimators': randint(40, 1000),
'learning_rate': uniform(0.01, 0.59),
'subsample': uniform(0.3, 0.6),
'max_depth': [3, 4, 5, 6, 7, 8, 9],
'colsample_bytree': uniform(0.5, 0.4),
'min_child_weight': [0.05, 0.1, 0.02]
}
scoring = sklearn.metrics.make_scorer(sklearn.metrics.ndcg_score, k=10,
greater_is_better=True)
model = xgb.XGBRanker(
tree_method='hist',
booster='gbtree',
objective='rank:ndcg',)
clf = RandomizedSearchCV(model,
param_distributions=param_dist,
cv=5,
n_iter=5,
scoring=scoring,
error_score=0,
verbose=3,
n_jobs=-1)
clf.fit(X_train,Y_train, group=groups)
Then I have the following error message which it seems be related to my construction of groups but I don't see why (Knowing that without the randomsearch the model works) :
Check failed: group_ptr_.back() == num_row_ (11544 vs. 9235) : Invalid group structure. Number of rows obtained from groups doesn't equal to actual number of rows given by data.
Same problem as here:(Tuning XGBRanker produces error for groups)