How to encode a categorical feature with high cardinality?

Question

Im stuck in a dataset that contains some categrotical features with a high cardinality. like 'item_description' ... I read about some trick called hashing, but its main idea is still blurry and incomprehensible, i also read about a library called 'Feature engine' but i didn't really find something that might solve my issue. Any suggestions please?

is item description a long string? meaningful english string? — Zabir Al Nazi, May 04 '20 at 05:11

avvinci · Answer 1 · 2020-05-14T15:51:36.213

Options:

i) Use Target encoding.

More on target encoding : https://maxhalford.github.io/blog/target-encoding/
Good tutorial on categorical variables here: https://www.coursera.org/learn/competitive-data-science#syllabus [Section: Feature Preprocessing and Generation with Respect to Models , 3rd Video]

ii) Use entity embeddings: In short, this technique represent each category by a vector, then training to obtain the characteristics of the category.

Tutorial : https://towardsdatascience.com/deep-learning-structured-data-8d6a278f3088
Notebook implementations:
1. https://www.kaggle.com/aquatic/entity-embedding-neural-net
2. https://www.kaggle.com/abhishek/same-old-entity-embeddings

iii) Use Catboost :

Tutorial : https://www.kaggle.com/mitribunskiy/tutorial-catboost-overview/notebook

Extra: There is a hashing trick technique which might also be helpful: https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087?gi=3045c6e13ee5

yatu · Answer 2 · 2020-05-04T08:19:19.280

You could look into the category_encoders. There you have many different encoders, which you can use to encode columns with high cardinality into a single column. Among them there are what are known as Bayesian encoders, which use information from the target variable to transform a given feature. For instance you have the TargetEncoder, which uses Bayesian principles to replace a categorical feature with the expected value of the target given then values the category takes, which is very similar to LeaveOneOut. You may also check the catboost based CatBoostEncoder which is a common choice for feature encoding.

Aicha BOKBOT · Answer 3 · 2023-06-28T11:45:49.020

This Medium article I wrote might help as well: 4 ways to encode categorical features with high cardinality. It explores four encoding methods applied to a dataset with 26 categorical features with cardinalities up to 40k (includes code):

Target encoding

PROS: parameter free; no increase in feature space
CONS: risk of target leakage (target leakage means using some information from target to predict the target itself); when categories have few samples, the target encoder would replace them by values very close to the target which makes the model prone to overfitting the training set; does not accept new values in testing set

Count encoding

PROS: easy to understand and implement; parameter free; no increase in feature space
CONS: risk of information loss when collision happens; can be too simplistic (the only information we keep from the categorical features is their frequency); does not accept new values in testing set

Feature hashing

PROS: limited increase of feature space (as compared to one hot encoding); does not grow in size and accepts new values during inference as it does not maintain a dictionary of observed categories; captures interactions between features when feature hashing is applied on all categorical features combined to create a single hash
CONS: need to tune the parameter of hashing space dimension; risk of collision when the dimension of hashing space is not big enough

Embedding

PROS: limited increase of feature space (as compared to one hot encoding); accepts new values during inference; captures interactions between features and learns the similarities between categories
CONS: need to tune the parameter of embedding size; the embeddings and a logistic regression model cannot be trained synergically in one phase, since logistic regression do not train with backpropagation. Rather, embeddings has to be trained in an initial phase, and then used as static inputs to the decision forest model.

ah okay. please carefully read [/help/promotion](/help/promotion) and apply what you learn here :) — starball, Jun 28 '23 at 17:59

score 1 · Answer 4 · answered Sep 21 '21 at 07:23

For variables like "item_description" which are in essence text variables, check this paper and corresponding Python package.

Or simply search online for "dirty categorical variables" and if in doubt, it is the article and package are from Gal Varoquaux, one of the main developers from Sklearn.

score 0 · Answer 5 · answered May 04 '23 at 04:39

Hashing is a technique used to transform categorical data into numerical data. The main idea behind hashing is to map each category to a unique integer by applying a hashing function. The resulting integers can then be used as input to machine learning algorithms.

One common hashing function used for this purpose is the MurmurHash algorithm, designed to provide high-quality hashing with good performance. Hashing has several applications including data retrieval, checking data corruption, and data encryption also. We have multiple hash functions, such as Message Digest (MD, MD2, MD5), Secure Hash Function (SHA0, SHA1, SHA2), and many more.

Hashing transforms the data in lesser dimensions, it may lead to loss of information. However, one potential drawback of hashing is that different categories may end up being mapped to the same integer, which can result in collisions. This can be mitigated by using a larger hash space (i.e., more bits) or by using a different hashing function.

Another approach to handling high-cardinality categorical variables is to use target encoding or mean encoding. This involves replacing each category with the average target value for that category in the training data. This can be effective, but it can also lead to overfitting, particularly if the number of categories is very large.

How to encode a categorical feature with high cardinality?

5 Answers5

Resources

Linked