Reducing the Sparsity of a One-Hot Encoded dataset

Question

I'm trying to do some feature selection algorithms on the UCI adult data set and I'm running into a problem with Univaraite feature selection. I'm doing onehot encoding on all the categorical data to change them to numerical but that gives me a lot of f scores.

How can I avoid this? What should I do to make this code better?

# Encode
adult['Gender'] = adult['sex'].map({'Female': 0, 'Male': 1}).astype(int)
adult = adult.drop(['sex'], axis=1)

adult['Earnings'] = adult['income'].map({'<=50K': 0, '>50K': 1}).astype(int)
adult = adult.drop(['income'], axis=1)

#OneHot Encode
adult = pd.get_dummies(adult, columns=["race"])

target = adult["Earnings"]
data = adult.drop(["Earnings"], axis=1)

selector = SelectKBest(f_classif, k=5)
selector.fit_transform(data, target)

for n,s in zip( data.head(0), selector.scores_):
    print "F Score ", s,"for feature ", n

EDIT:
Partial results of current code:
F Score 26.1375747945 for feature race_Amer-Indian-Eskimo
F Score 3.91592196913 for feature race_Asian-Pac-Islander
F Score 237.173133254 for feature race_Black
F Score 31.117798305 for feature race_Other
F Score 218.117092671 for feature race_White

Expected Results:
F Score "f_score" for feature "race"

By doing the one hot encoding the feature in above is split into many sub-features, where I would just like to generalize it to just race (see Expected Results) if that is possible.

Please: shorten code to the minimum necessary, and include sample data and desired results. General guidelines on asking questions: http://stackoverflow.com/help/mcve Pandas specific: http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — JohnE, Dec 17 '16 at 15:55
@Username Can I suggest you edit the name of this question more descriptive of the actual issue you have. Maybe to along the lines of "Reducing the number of categorically encoded features for feature selection?" or "Reducing the Sparsity of a One-Hot Encoded dataset". — Little Bobby Tables, Dec 17 '16 at 16:04
@JohnE and josh, Thank you for your comments! I've made some changes to the question — Username, Dec 17 '16 at 16:28

score 3 · Accepted Answer · edited May 23 '17 at 12:24

One way in which you can reduce the number of features, whilst still encoding your categories in a non-ordinal manner, is by using binary encoding. One-hot-encoding has a linear growth rate n where n is the number of categories in a categorical feature. Binary encoding has log_2(n) growth rate. In other words, doubling the number of categories adds a single column for binary encoding, where as it doubles the number of columns for one-hot encoding.

Binary encoding can be easily implemented in python by using the categorical_encoding package. The package is pip installable and works very seamlessly with sklearn and pandas. Here is an example

import pandas as pd
import category_encoders as ce

df = pd.DataFrame({'cat1':['A','N','K','P'], 'cat2':['C','S','T','B']})

enc_bin = ce.binary_encoding.BinaryEncoding(cols=['cat1']) # cols=None, all string columns encoded

df_trans = enc_bin.fit_transform(df)
print(df_trans)


Out[1]:
           cat1_0  cat1_1 cat2
    0       1       1      C
    1       0       1      S
    2       1       0      T
    3       0       0      B

Here is the code from a previous answer by me using the same variables as above but with one-hot encoding. Lets compare how the two different outputs look.

import pandas as pd
import category_encoders as ce

df = pd.DataFrame({'cat1':['A','N','K','P'], 'cat2':['C','S','T','B']})
enc_ohe = ce.one_hot.OneHotEncoder(cols=['cat1']) # cols=None, all string columns encoded

df_trans = enc_ohe.fit_transform(df)
print(df_trans)


Out[2]:
       cat1_0  cat1_1  cat1_2  cat1_3 cat2
    0       0       0       1       0    C
    1       0       0       0       1    S
    2       1       0       0       0    T
    3       0       1       0       0    B

See how binary encoding uses half as many columns to uniquely describe each category within the category cat1.

Thanks for the well explained answer! I think this will do the trick :). — Username, Dec 17 '16 at 16:45

Reducing the Sparsity of a One-Hot Encoded dataset

1 Answers1