0

I try to optimize my hyperparameters of my XGBoost Ranker model, but I can't

Here is what my table (df on code) looks like :

query relevance features
1 5 5.4.7....
1 3 6........
2 5 3........
2 3 8........
3 2 1........

Then I split my table on train test with on the test table only one query:

gss = GroupShuffleSplit(test_size=1, n_splits=1,).split(df, groups=df['query'])
X_train_inds, X_test_inds = next(gss)

train_data= df.iloc[X_train_inds]
X_train=train_data.drop(columns=["relevance"])
Y_train=train_data.relevance

test_data= df.iloc[X_test_inds]
X_test=test_data.drop(columns=["relevance"])
Y_test=test_data.relevance

and constitute groups which is the number of lines by query:

groups = train_data.groupby('query').size().to_frame('size')['size'].to_numpy()

And then I run my model and try to optimize the hyperparameters with a RandomizedSearchCV:

param_dist = {'n_estimators': randint(40, 1000),
              'learning_rate': uniform(0.01, 0.59),
              'subsample': uniform(0.3, 0.6),
              'max_depth': [3, 4, 5, 6, 7, 8, 9],
              'colsample_bytree': uniform(0.5, 0.4),
              'min_child_weight': [0.05, 0.1, 0.02]
              }

scoring = sklearn.metrics.make_scorer(sklearn.metrics.ndcg_score, k=10,
                                      greater_is_better=True)

model = xgb.XGBRanker(  
    tree_method='hist',
    booster='gbtree',
    objective='rank:ndcg',)
clf = RandomizedSearchCV(model,
                         param_distributions=param_dist,
                         cv=5,
                         n_iter=5,  
                         scoring=scoring, 
                         error_score=0,
                         verbose=3,
                         n_jobs=-1)
clf.fit(X_train,Y_train, group=groups)

Then I have the following error message which it seems be related to my construction of groups but I don't see why (Knowing that without the randomsearch the model works) :

Check failed: group_ptr_.back() == num_row_ (11544 vs. 9235) : Invalid group structure. Number of rows obtained from groups doesn't equal to actual number of rows given by data.

Same problem as here:(Tuning XGBRanker produces error for groups)

Hcoic
  • 1
  • 1
  • It seems to me the problem is that the train test split required by `XGBRanker` is incompatible with the default CV method in `RandomizedSearchCV`. Note that you have to use `GroupShuffleSplit` and provide `group` values to `XGBRanker`, yet neither of these are available in `RandomizedSearchCV` without further customization. – Fanchen Bao Jan 18 '23 at 21:12

0 Answers0