1

I am trying to find reliable hyper parameters for training a multiclass classifier, using both lgbm's "gbdt" and scikitlearn's GridsearchCV.

On the feature side of things there is a ~4k x 40 matrix, containing continuous values. On the labeling side there is a pool of 4 categorical mutually exclusive classes.

To judge whether any given fold is performing well I would like to use lgbm's auc_mu metric, but I'm ok with any at this point. As you can see in the code below I resorted to weighted accuracy instead.

Below is a simplified version of how the gridsearch is initialised.

param_set = {
 'n_estimators':[15, 25]
}
clf = lgb.LGBMModel(
    boosting_type='gbdt',
    num_leaves=31,
    max_depth=5,
    learning_rate=0.1,
    n_estimators=100,
    objective='multiclass',
    num_class= len(np.unique(training_data.label)),
    min_split_gain=0,
    min_child_weight=1e-3,
    min_child_samples=10,
    subsample=1,
    subsample_freq=0,
    colsample_bytree=0.6,
    reg_alpha=0.3,
    reg_lambda=0.7,
    random_state=42,
    n_jobs=2)
gsearch = GridSearchCV(estimator = clf, 
    param_grid = param_set,
    scoring="balanced_accuracy",
    error_score='raise',
    n_jobs=2,
    cv=5,
    verbose = 2)

When I try to call the fit function on the GridSearchCV object,

# separate total data into train/validation and test
stratifiedss = StratifiedShuffleSplit(
     n_splits = 1, test_size = 0.2, train_size = 0.8, random_state=723)

for train_ind, test_ind in stratifiedss.split(X,y):
    train_feature_obs = X.loc[train_ind]
    train_labels = y[train_ind]
    validation_feature_obs = X.loc[test_ind]
    validation_labels = y[test_ind]

# transform data into lgb Dataset
training_data = lgb.Dataset(train_feature_obs, label=train_labels)

# call the GridSearchCV.fit
lgb_model2 = gsearch.fit(training_data.data.reset_index(drop=True), training_data.label)

it returns

ValueError: Classification metrics can't handle a mix of unknown and continuous-multioutput targets

So I am guessing the sklearnGridSearchCV has trouble evaluating the output of lgbmModel.predict().

I tried fitting a lgbmModel separetly and it should return an array with probabilities of the observation for each of the four classes, summing up to 100%.

I looked at:

But that has not been conclusive yet.

How can I enable the sklearn.GridSearchCV to evaluate the performance of each fold of the lgbmModel classifier? I am mostly confused as to where the "unknown" type is comnig from.

Any help would be much appreciated.

Regards, Robert

0 Answers0