1

I have 3 types of categorical data in my dataframe, df.

df['Vehicles Owned'] = [1,2,3+,2,1,2,3+,2]
df['Sex'] = ['m','m','f','m','f','f','m','m']
df['Income'] = [42424,65326,54652,9463,9495,24685,52536,23535]

What should I do for the df['Vehicles Owned'] ? (one hot encode, labelencode or leave it as is by converting 3+ to integer. I have used integer values as they are. looking for the suggestions as there is order)

for df['Sex'] , should I labelEncode it or One hot? ( as there is no order, I have used One Hot Encoding)

df['Income'] has lots of variations. so should I convert it to bins and use One Hot Encoding explaining low,medium,high incomes?

Deshwal
  • 3,436
  • 4
  • 35
  • 94
  • 1
    Assuming a linear model, do you want owning 2(3) vehicles to have 2x(3x) the effect of owning 1 vehicle, or can the relationship be entirely independent? The answer to that determines if you want dummies or to encode the variable, though given 3+ I'm guessing dummies. – ALollz Dec 18 '19 at 18:23
  • As a side note, in scikit-learn, you should not use `LabelEncoder` to encode `X`. `LabelEncoder` is only to encode `y`. The `OneHotEncoder` is the way to go. In scikit-learn 0.22, you have a new option `drop` which allow you to drop one of the column which is colinear with the rest of the feature encoded. – glemaitre Dec 19 '19 at 10:47

1 Answers1

1

I would recommend:

  • For sex, one-hot encode, which translates to using a single boolean var for is_female or is_male; for n categories you need n-1 one-hot-encoded vars because the nth is linearly dependent on the first n-1.

  • For vehicles_owned if you want to preserve order, I would re-map your vars from [1,2,3,3+] to [1,2,3,4] and treat as an int var, or to [1,2,3,3.5] as a float var.

  • For income: you should probably just leave that as a float var. Certain models (like GBT models) will likely do some sort of binning under the hood. If your income data happens to have an exponential distribution, you might try loging it. But just converting it to bins in your own feature-engineering is not what I'd recommend.

Meta-advice for all these things is set up a cross-validation scheme you're confident in, try different formulations for all your feature-engineering decisions, and then follow your cross-validated performance measure to make your ultimate decision.

Finally, between which library/function to use I prefer pandas' get_dummies because it allows you to keep column-names informative in your final feature-matrix like so: https://stackoverflow.com/a/43971156/1870832

Max Power
  • 8,265
  • 13
  • 50
  • 91