1

How does pandas categorical https://pandas.pydata.org/pandas-docs/stable/categorical.html handle new and unseen levels? I am thinking about a scikit-learn like setup. Currently, I have something like: https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce

def: fit() 
  for each column:
   fit a label encoder:
def: transform()
  for each column:
     check if column was unseen
       yes(unseen) replace 
       no: label encode

but this is pretty slow.

Apparently, decision trees like xgboost or lightbm can directly handle categorical data, i.e. one would not need to fiddle around manually with this slow conversion. But when looking at their code https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/sklearn.py#L532 they seem to use LGBMLabelEncoderwhich is a standard scikit-learn LabelEncoder.

I wonder how that can handle unseen data.

If a manual conversion is required would pandas.Categorical allow a quicker conversion - even if unseen levels are in the new data?

edit

Please see https://github.com/geoHeil/pythonQuestions/blob/master/categorical-encoding.ipynb for an overview how I could not get scikit-learn's usual suspects to work. Still looking for something more performant than my solution. Also lightGBM https://github.com/Microsoft/LightGBM/issues/789 suggests to use custom encoding strategy.

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • 1
    Pandas.Categorical just fill them with Nan. Scikit-learn also dont handle new data, they will most likely be removed or replaced with Nan. Try using LabelEncoder with unseen data and you will get `ValueError: y contains new labels:`. – Vivek Kumar Aug 17 '17 at 06:47
  • why this question got downvoted?? I guess a small reproducible data set and a desired one could help to understand the problem better... – MaxU - stand with Ukraine Aug 17 '17 at 08:59
  • 2
    @MaxU Maybe because in my opinion its more of a algorithm approach problem than a programming problem. This is a frequent issue in machine learning on how to handle unseen data and [Cross-validated](http://stats.stackexchange.com) is the right place for that. – Vivek Kumar Aug 17 '17 at 10:23
  • @VivekKumar, i'd say it's in the grey zone (in between) ;-) – MaxU - stand with Ukraine Aug 17 '17 at 10:26
  • @VivekKumar if you want to move the question - that is fine. – Georg Heiler Aug 17 '17 at 11:03
  • 1
    I dont have enough privileges to do so on my own. If others feel this too, then only it can be moved. Thats just my opinion but I think this might get better attention there. – Vivek Kumar Aug 17 '17 at 11:06
  • @VivekKumar, what about LabelBinarizer? Sure there is currently a [bug](https://github.com/scikit-learn/scikit-learn/issues/6723#issuecomment-323036777) but it works great on unseen data, but losing some information on the relations of the categorys – Quickbeam2k1 Aug 17 '17 at 12:08
  • @Quickbeam2k1 It maps all columns to 0 in a multilabel scenario and maps to first class in a binary scenario, so its upto the OP if he wants something like this. Which in my opinion brings us to whats the most used approach in most real situations when faced with this. – Vivek Kumar Aug 17 '17 at 12:17
  • What do you mena with class 0? I.e. the first observation for this feature? – Georg Heiler Aug 17 '17 at 12:19
  • I just recalled, that I once askes a similar [question](https://stackoverflow.com/questions/39804733/dummy-creation-in-pipeline-with-different-levels-in-train-and-test-set). The answer might be worth a look if you know all potential categories beforehand – Quickbeam2k1 Aug 17 '17 at 12:42
  • 1
    The real solution is in https://github.com/scikit-learn/scikit-learn/pull/9151 and https://github.com/scikit-learn/scikit-learn/pull/9012 which are unfortunately not merged yet. LabelEncoder + OneHotEncoder or LabelBinarizer or CountVectorizer(tokenizer=lambda x: x) are all possible workarounds (though none is great). – Andreas Mueller Aug 17 '17 at 15:09

1 Answers1

0

There might be a pandas solutin, but it works probably best with sklearns LabelBinarizer

from sklearn.preprocessing import LabelBinarizer
df= pd.DataFrame({'A':['a','b','c','a']})
lb = LabelBinarizer()
lb.fit(df["A"])
lb.transform(df["A"])

[[1 0 0]
 [0 1 0]
 [0 0 1]
 [1 0 0]]

df2 = pd.DataFrame({'A':['a','b','d']})
lb.transform(df2['A'])
[[1 0 0]
 [0 1 0]
 [0 0 0]]

So we see that 'd' is essentially mapped to neither 'a','b' or 'c'. Note however, that there is a bug which probably will be resolved in one of the next sklearn releases.

The LabelBinarizer is fit during training and recalls the values passed to it. New values get mapped to all zeros. It might be more feasible do write a transformer (as seen here before the edit) using pandas get_dummies.

This could be quite straightforward due to name matching of columns. Fit in the first step and store the column names, than just transform in the transformstep, but only keep column names that you identified in fitting (potentially adding zome zero columns if training levels are not present in the test set). Then you are done ;)

Quickbeam2k1
  • 5,287
  • 2
  • 26
  • 42
  • But that is lacking the focabulary - you do not guarantee that a second df with potentially unseen categorical levels or levels in different ordering to be encoded properly. – Georg Heiler Aug 17 '17 at 12:01
  • You are right, the lack of ordering might be missing, however, when working on columns with discrete values, new categories/values will be mapped to zero in every column. – Quickbeam2k1 Aug 17 '17 at 12:04
  • Regarding ordering, if df_1 contains a,b and this is mapped to 1,2 and df_2 contains b,c this should be mapped to 2,0 so it should not be a problem? Or do I misunderstand this? – Georg Heiler Aug 17 '17 at 12:08
  • I'll add an example in the answer, give me a moment, and there seems to be a mistake in the fit function ;) – Quickbeam2k1 Aug 17 '17 at 12:09
  • updated it and removed the previous class instead using the labelbinarizer from sklearn. However, if you are interested I could also construct a related solution with pandas `get_dummies` (or try it on you own) – Quickbeam2k1 Aug 17 '17 at 12:38
  • Can the binarized handle multiple columns at the same time? – Georg Heiler Sep 10 '17 at 08:32