How does pandas categorical https://pandas.pydata.org/pandas-docs/stable/categorical.html handle new and unseen levels? I am thinking about a scikit-learn like setup. Currently, I have something like: https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce
def: fit()
for each column:
fit a label encoder:
def: transform()
for each column:
check if column was unseen
yes(unseen) replace
no: label encode
but this is pretty slow.
Apparently, decision trees like xgboost or lightbm can directly handle categorical data, i.e. one would not need to fiddle around manually with this slow conversion.
But when looking at their code
https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/sklearn.py#L532 they seem to use LGBMLabelEncoder
which is a standard scikit-learn LabelEncoder
.
I wonder how that can handle unseen data.
If a manual conversion is required would pandas.Categorical allow a quicker conversion - even if unseen levels are in the new data?
edit
Please see https://github.com/geoHeil/pythonQuestions/blob/master/categorical-encoding.ipynb for an overview how I could not get scikit-learn's usual suspects to work. Still looking for something more performant than my solution. Also lightGBM https://github.com/Microsoft/LightGBM/issues/789 suggests to use custom encoding strategy.