11

I am trying to find the best parameters for a lightgbm model using GridSearchCV from sklearn.model_selection. I have not been able to find a solution that actually works.

I have managed to set up a partly working code:

import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold

np.random.seed(1)

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
y = pd.read_csv('y.csv')
y = y.values.ravel()
print(train.shape, test.shape, y.shape)

categoricals = ['COL_A','COL_B']
indexes_of_categories = [train.columns.get_loc(col) for col in categoricals]

gkf = KFold(n_splits=5, shuffle=True, random_state=42).split(X=train, y=y)

param_grid = {
    'num_leaves': [31, 127],
    'reg_alpha': [0.1, 0.5],
    'min_data_in_leaf': [30, 50, 100, 300, 400],
    'lambda_l1': [0, 1, 1.5],
    'lambda_l2': [0, 1]
    }

lgb_estimator = lgb.LGBMClassifier(boosting_type='gbdt',  objective='binary', num_boost_round=2000, learning_rate=0.01, metric='auc',categorical_feature=indexes_of_categories)

gsearch = GridSearchCV(estimator=lgb_estimator, param_grid=param_grid, cv=gkf)
lgb_model = gsearch.fit(X=train, y=y)

print(lgb_model.best_params_, lgb_model.best_score_)

This seems to be working but with a UserWarning:

categorical_feature keyword has been found in params and will be ignored. Please use categorical_feature argument of the Dataset constructor to pass this parameter.

I am looking for a working solution or perhaps a suggestion on how to ensure that lightgbm accepts categorical arguments in the above code

Harsh Gupta
  • 135
  • 2
  • 11
bhaskarc
  • 9,269
  • 10
  • 65
  • 86
  • may I ask, is there a reason why a scoring function is omitted in the GridSearchCV? – Helen Oct 06 '18 at 11:28
  • you would be better off using lightgbm's default api for crossvalidation (lgb.cv) instead of GridSearchCV, as you can use early_stopping_rounds in lgb.cv. – Sift Feb 12 '19 at 04:58

2 Answers2

8

As the warning states, categorical_feature is not one of the LGBMModel arguments. It is relevant in lgb.Dataset instantiation, which in the case of sklearn API is done directly in the fit() method see the doc. Thus, in order to pass those in the GridSearchCV optimisation one has to provide it as an argument of the GridSearchCV.fit() method in the case of sklearn v0.19.1 or as an additional fit_params argument in GridSearchCV instantiation in older sklearn versions

Mischa Lisovyi
  • 3,207
  • 18
  • 29
1

In case you are struggling with how to pass the fit_params, which happened to me as well, this is how you should do that:

fit_params = {'categorical_feature':indexes_of_categories}
clf = GridSearchCV(model, param_grid, cv=n_folds)
clf.fit(x_train, y_train, **fit_params)
saeedghadiri
  • 196
  • 1
  • 5