18

From my reading of the LightGBM document, one is supposed to define categorical features in the Dataset method. So I have the following code:

cats=['C1', 'C2']
d_train = lgb.Dataset(X, label=y, categorical_feature=cats)

However, I received the following error message:

/app/anaconda3/anaconda3/lib/python3.7/site-packages/lightgbm/basic.py:1243: UserWarning: Using categorical_feature in Dataset. warnings.warn('Using categorical_feature in Dataset.')

Why did I get the warning message?

David293836
  • 1,165
  • 2
  • 18
  • 36

3 Answers3

29

I presume that you get this warning in a call to lgb.train. This function also has argument categorical_feature, and its default value is 'auto', which means taking categorical columns from pandas.DataFrame (documentation). The warning, which is emitted at this line, indicates that, despite lgb.train has requested that categorical features be identified automatically, LightGBM will use the features specified in the dataset instead.

To avoid the warning, you can give the same argument categorical_feature to both lgb.Dataset and lgb.train. Alternatively, you can construct the dataset with categorical_feature=None and only specify the categorical features in lgb.train.

Andrey Popov
  • 661
  • 5
  • 5
  • > construct the dataset with categorical_feature=None Not really possible if using a DataFrame through the sklearn API. – memeplex Dec 10 '20 at 21:27
  • 4
    Does this still work? No matter how/where I specify the categorical variables I keep getting `Overriding the parameters from Reference Dataset. categorical_column in param dict is overridden.` – ironv May 01 '22 at 17:18
  • Unfortunately the warning does still occur – CutePoison May 25 '22 at 09:43
4

Like user andrey-popov described you can use the lgb.train's categorical_feature parameter to get rid of this warning.

Below is a simple example with some code how you could do it:

# Define categorical features
cat_feats = ['item_id', 'dept_id', 'store_id', 
             'cat_id', 'state_id', 'event_name_1',
             'event_type_1', 'event_name_2', 'event_type_2']
    ...

# Define the datasets with the categorical_feature parameter
train_data = lgb.Dataset(X.loc[train_idx], 
                         Y.loc[train_idx], 
                         categorical_feature=cat_feats, 
                         free_raw_data=False)

valid_data = lgb.Dataset(X.loc[valid_idx], 
                         Y.loc[valid_idx], 
                         categorical_feature=cat_feats, 
                         free_raw_data=False)

# And train using the categorical_feature parameter
lgb.train(lgb_params, 
          train_data, 
          valid_sets=[valid_data], 
          verbose_eval=20, 
          categorical_feature=cat_feats, 
          num_boost_round=1200)
codeananda
  • 939
  • 1
  • 10
  • 16
gil.fernandes
  • 12,978
  • 5
  • 63
  • 76
  • 4
    Sorry for the late response. I still got the following warning messages: /home/dlin/.conda/envs/mybase/lib/python3.7/site-packages/lightgbm/basic.py:1279: UserWarning: Overriding the parameters from Reference Dataset. warnings.warn('Overriding the parameters from Reference Dataset.') /home/dlin/.conda/envs/mybase/lib/python3.7/site-packages/lightgbm/basic.py:1091: UserWarning: categorical_column in param dict is overridden. warnings.warn('{} in param dict is overridden.'.format(cat_alias)) – David293836 Aug 04 '20 at 03:20
  • Same here - I feel theres a lot of bugs right now regarding verbosity in lightGBM – CutePoison May 25 '22 at 09:43
0

This is less of an answer to the original OP and more of an answer to people who are using sklearn API and encounter this issue. For those of you who are using sklearn API, especially using one of the cross_val methods from sklearn, there are two solutions you could consider using.

Sklearn API solution A solution that worked for me was to cast categorical fields into the category datatype in pandas.

If you are using pandas df, LightGBM should automatically treat those as categorical. From the documentation:

integer codes will be extracted from pandas categoricals in the Python-package

It would make sense for this to be the equivalent in the sklearn API to setting categoricals in the Dataset object. But keep in mind that LightGBM does not officially support virtually any of the non-core parameters for sklearn API, and they say so explicitly:

**kwargs is not supported in sklearn, it may cause unexpected issues.

Adaptive Solution The other, more sure-fire solution to being able to use methods like cross_val_predict and such is to just create your own wrapper class that implements the core Dataset/Train under the hood but exposes a fit/predict interface for the cv methods to latch onto. That way you get the full functionality of lightGBM with only a little bit of rolling your own code.

The below sketches out what this could look like.

class LGBMSKLWrapper:

    def __init__(self, categorical_variables, params):
        self.categorical_variables = categorical_variables
        self.params = params
        self.model = None

    def fit(self, X, y):
        my_dataset = ltb.Dataset(X, y, categorical_feature=self.categorical_variables)
        self.model = ltb.train(params=self.params, train_set=my_dataset)
    
    def predict(self, X):
        return self.model.predict(X)

The above lets you load up your parameters when you create the object, and then passes that onto train when the client calls fit.

David R
  • 994
  • 1
  • 11
  • 27