-1

I am not sure what is the most effective way to deal with categorical variables for a regression problem.

My table looks like :

Date        Category    Sales

1/1/2018     Shoes       200

1/2/2018     Shoes       300

1/1/2018     home        100

The problem I am dealing with is sales forecasting.

What is the best way to deal with the Category column? Get dummies or label encoder? I used label encoder, followed by the standard scaler but I got very poor fitting. After that I scaled all my inputs (date, category) with the exception of the y variable (Sales).

wwnde
  • 26,119
  • 6
  • 18
  • 32
lcasucci
  • 77
  • 3
  • 11
  • dummies / label encoder is a common way – Roim May 14 '20 at 07:25
  • One-Hot Encoding is one solution to it. – PraneetNigam May 14 '20 at 07:30
  • You shouldn't use a LabelEncoder. See [LabelEncoder for categorical features?](https://stackoverflow.com/questions/61217713/labelencoder-for-categorical-features/61217936#61217936). You should use either a OneHot encoder, or if the cardinality is huge you may look into bayesian encoders. See [this other answer](https://stackoverflow.com/questions/61585507/how-to-encode-a-categorical-feature-with-high-cardinality/61587769#61587769) – yatu May 14 '20 at 08:21

1 Answers1

1

Label encoder isn't recommended.

Target encoding if the cardinality of data is high else you can try both one-hot encoding and target encoding.

Sample Notebook using target encoding for time series forecasting : https://www.kaggle.com/avvinci/time-series-forecasting-beginners [cell 21]

More on target encoding : https://maxhalford.github.io/blog/target-encoding/

Good tutorial on categorical variables here: https://www.coursera.org/learn/competitive-data-science#syllabus [Section: Feature Preprocessing and Generation with Respect to Models , 3rd Video]

avvinci
  • 366
  • 2
  • 6