How do I encode CATEGORICAL_COLUMNS which have lots of values?

Question

I am using Tensorflow,python, Pandas to create a logistic regression model similar to this link

Instead of MNist dataset, I am using my own dataset. I use Pandas to create the dataframes, replace nulls with fillna function and then convert it into tensor dataset using from_tensor_slices.

I have many CATEGORICAL_COLUMNS and I am using get_dummies to do the OneHotEncoding (along with LabelEncoding). But the issue is, my categorical columns have big vocabulary list (Ex: Zipcode...i have thousands of zipcodes in my data). So when I create columns using "get_dummies", the Zip code is creating lots of new columns for me.

Is this a good approach? Is this the way I should approach this kind of dataset ?

score 0 · Accepted Answer · answered Mar 30 '20 at 01:37

When you have that many different categories creating a dummy columns is not going to help. I would suggest looking for creating categories for a county or some manageable groups of zip codes. If that does not work i would just label encodefrom sklearn.preprocessing import LabelEncoder and include the encoded column as a feature to see if it come as an important feature before thinking about more feature engineering.

How do I encode CATEGORICAL_COLUMNS which have lots of values?

1 Answers1