0

I am working on a dataset which has a feature that has multiple categories for a single example. The feature looks like this:-

                              Feature
0   [Category1, Category2, Category2, Category4, Category5]
1                     [Category11, Category20, Category133]
2                                    [Category2, Category9]
3                [Category1000, Category1200, Category2000]
4                                              [Category12]

The problem is similar to the this question posted:- Encode categorical features with multiple categories per example - sklearn

Now, I want to vectorize this feature. One solution is to use MultiLabelBinarizer as suggested in the answer of the above similar question. But, there are around 2000 categories, which results into a sparse and very high dimentional encoded data.

Is there any other encoding that can be used? Or any possible solution for this problem. Thanks.

3 Answers3

1

In many cases when I encountered the problem of too many features being generated from a column with many categories, I opted for binary encoding and it worked out fine most of the times and hence is worth a shot for you perhaps.

Imagine you have 9 features, and you mark them from 1 to 9 and now binary encode them, you will get:

cat 1 - 0 0 0 1
cat 2 - 0 0 1 0
cat 3 - 0 0 1 1
cat 4 - 0 1 0 0 
cat 5 - 0 1 0 1
cat 6 - 0 1 1 0
cat 7 - 0 1 1 1
cat 8 - 1 0 0 0
cat 9 - 1 0 0 1

This is the basic intuition behind Binary Encoder.


PS: Given that 2 power 11 is 2048 and you may have 2000 categories or so, you can reduce your categories to 11 feature columns instead of many (for example, 1999 in the case of one-hot)!
Ankur Sinha
  • 6,473
  • 7
  • 42
  • 73
1

Given an incredibly sparse array one could use a dimensionality reduction technique such as PCA (Principal component analysis) to reduce the feature space to the top k features that best describe the variance.

Assuming the MultiLabelBinarizered 2000 features = X

from sklearn.decomposition import PCA
k = 5
model = PCA(n_components = k, random_state = 666)
model.fit(X)
Components = model.predict(X)

And then you can use the top K components as a smaller dimensional feature space that can explain a large portion of the variance for the original feature space.

If you want to understand how well the new smaller feature space describes the variance you could use the following command

model.explained_variance_
Jlanday
  • 112
  • 5
0

I also encountered these same problems but I solved using Countvectorizer from sklearn.feature_extraction.text just by giving binary=True, i.e CounterVectorizer(binary=True)

RobC
  • 22,977
  • 20
  • 73
  • 80