3

Ho can I persistently encode the same String to the same column? Label encoding across multiple columns in scikit-learn propose a nice way to handle a data frame with multiple categorical values. However, I am unsure if this correctly persists (in a pickle) and would apply the same labels again for freshly incoming data.

So far I used pandas directly and obtained the labels via .cat.codes of the category values. But Now I need to integrate label encoding into a pipeline to deal with fresh incoming data.

Would something like

le = LabelEncoder()
for col in df.select_dtypes([], ['object'].columns:
    df[col] = le.fit_transform(df[col])

Or the proposed solution of the MultiColumnLabelEncoder suffice for my task?

Community
  • 1
  • 1
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

3 Answers3

1

Came across the same problem and was able to find a work around, if we can save the encoder instance info, we may reuse it to produce expected outputs. below link has the detailed answer to it: Using Scikit's LabelEncoder correctly across multiple programs

mayank
  • 11
  • 3
  • https://stackoverflow.com/questions/28656736/using-scikits-labelencoder-correctly-across-multiple-programs?noredirect=1&lq=1 – MT467 May 07 '20 at 01:58
0

For more generic approach, here is a custom function for fit and transform separately,

  • The fit function gets train DataFrame and categorical columns list returns a Dict of label encoder classes.
  • The Dict is pickled and loaded at the inference.
  • The transform function gets Inference DataFrame, categorical columns list and the encoder Dict pickle path and returns the label encoded DataFrame.

For function code and working example, please refer here,

Source: Link

Timoth Dev A
  • 144
  • 2
  • 9
-1

Seems to be already handled for the single column case Usng same Label Encoder to test dataset? or new Label Encoder?

So I used the aforementioned multi-column solution should which worked fine.

Community
  • 1
  • 1
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • this answer meas that you need to have the entire dataframe in memory at inference time. Far from ideal. – marbel Dec 14 '16 at 05:49
  • @marbel I understand. What solution would you propose? – Georg Heiler Dec 14 '16 at 06:05
  • Just to leave it here as a reference, I've answered the question [here](http://stackoverflow.com/questions/40321232/handling-unknown-values-for-label-encoding) – marbel Dec 14 '16 at 06:08