0

I am using Tensorflow,python, Pandas to create a logistic regression model similar to this link

Instead of MNist dataset, I am using my own dataset. I use Pandas to create the dataframes, replace nulls with fillna function and then convert it into tensor dataset using from_tensor_slices.

I have many CATEGORICAL_COLUMNS and I am using get_dummies to do the OneHotEncoding (along with LabelEncoding). But the issue is, my categorical columns have big vocabulary list (Ex: Zipcode...i have thousands of zipcodes in my data). So when I create columns using "get_dummies", the Zip code is creating lots of new columns for me.

Is this a good approach? Is this the way I should approach this kind of dataset ?

Relativity
  • 6,690
  • 22
  • 78
  • 128

1 Answers1

0

When you have that many different categories creating a dummy columns is not going to help. I would suggest looking for creating categories for a county or some manageable groups of zip codes. If that does not work i would just label encodefrom sklearn.preprocessing import LabelEncoder and include the encoded column as a feature to see if it come as an important feature before thinking about more feature engineering.

XXavier
  • 1,206
  • 1
  • 10
  • 14