Im stuck in a dataset that contains some categrotical features with a high cardinality. like 'item_description' ... I read about some trick called hashing, but its main idea is still blurry and incomprehensible, i also read about a library called 'Feature engine' but i didn't really find something that might solve my issue. Any suggestions please?
-
is item description a long string? meaningful english string? – Zabir Al Nazi May 04 '20 at 05:11
5 Answers
Options:
i) Use Target encoding.
More on target encoding : https://maxhalford.github.io/blog/target-encoding/
Good tutorial on categorical variables here: https://www.coursera.org/learn/competitive-data-science#syllabus [Section: Feature Preprocessing and Generation with Respect to Models , 3rd Video]
ii) Use entity embeddings: In short, this technique represent each category by a vector, then training to obtain the characteristics of the category.
Tutorial : https://towardsdatascience.com/deep-learning-structured-data-8d6a278f3088
Notebook implementations:
iii) Use Catboost :
Extra: There is a hashing trick technique which might also be helpful: https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087?gi=3045c6e13ee5

- 366
- 2
- 6
You could look into the category_encoders
. There you have many different encoders, which you can use to encode columns with high cardinality into a single column. Among them there are what are known as Bayesian encoders, which use information from the target variable to transform a given feature. For instance you have the TargetEncoder
, which uses Bayesian principles to replace a categorical feature with the expected value of the target given then values the category takes, which is very similar to LeaveOneOut
. You may also check the catboost based CatBoostEncoder
which is a common choice for feature encoding.

- 86,083
- 12
- 84
- 139
This Medium article I wrote might help as well: 4 ways to encode categorical features with high cardinality. It explores four encoding methods applied to a dataset with 26 categorical features with cardinalities up to 40k (includes code):
Target encoding
- PROS: parameter free; no increase in feature space
- CONS: risk of target leakage (target leakage means using some information from target to predict the target itself); when categories have few samples, the target encoder would replace them by values very close to the target which makes the model prone to overfitting the training set; does not accept new values in testing set
Count encoding
- PROS: easy to understand and implement; parameter free; no increase in feature space
- CONS: risk of information loss when collision happens; can be too simplistic (the only information we keep from the categorical features is their frequency); does not accept new values in testing set
Feature hashing
- PROS: limited increase of feature space (as compared to one hot encoding); does not grow in size and accepts new values during inference as it does not maintain a dictionary of observed categories; captures interactions between features when feature hashing is applied on all categorical features combined to create a single hash
- CONS: need to tune the parameter of hashing space dimension; risk of collision when the dimension of hashing space is not big enough
Embedding
- PROS: limited increase of feature space (as compared to one hot encoding); accepts new values during inference; captures interactions between features and learns the similarities between categories
- CONS: need to tune the parameter of embedding size; the embeddings and a logistic regression model cannot be trained synergically in one phase, since logistic regression do not train with backpropagation. Rather, embeddings has to be trained in an initial phase, and then used as static inputs to the decision forest model.

- 21
- 2
-
-
-
ah okay. please carefully read [/help/promotion](/help/promotion) and apply what you learn here :) – starball Jun 28 '23 at 17:59
For variables like "item_description" which are in essence text variables, check this paper and corresponding Python package.
Or simply search online for "dirty categorical variables" and if in doubt, it is the article and package are from Gal Varoquaux, one of the main developers from Sklearn.

- 827
- 6
- 21
Hashing is a technique used to transform categorical data into numerical data. The main idea behind hashing is to map each category to a unique integer by applying a hashing function. The resulting integers can then be used as input to machine learning algorithms.
One common hashing function used for this purpose is the MurmurHash algorithm, designed to provide high-quality hashing with good performance. Hashing has several applications including data retrieval, checking data corruption, and data encryption also. We have multiple hash functions, such as Message Digest (MD, MD2, MD5), Secure Hash Function (SHA0, SHA1, SHA2), and many more.
Hashing transforms the data in lesser dimensions, it may lead to loss of information. However, one potential drawback of hashing is that different categories may end up being mapped to the same integer, which can result in collisions. This can be mitigated by using a larger hash space (i.e., more bits) or by using a different hashing function.
Another approach to handling high-cardinality categorical variables is to use target encoding or mean encoding. This involves replacing each category with the average target value for that category in the training data. This can be effective, but it can also lead to overfitting, particularly if the number of categories is very large.
Resources

- 2,552
- 7
- 21
- 46