2

I have to use Decision Tree classifier to classify certain data. However, the attribute values are strings, and as I found here, it said that strings cannot be used as an input. Hence I used integer encoding for the strings.

In this article, I found out that passing integer-encoded data may result in a wrong answer since sklearn assumes an ordering among the data. So, the only way out is using OneHotEncoder module.

Using OneHotEncoder module increases the number of features (e.g. if there is an attribute 'price' with values ['high','med','low'], one-hot-encoding would result in inclusion of 3 attributes related to the actual attribute 'price'; those can be interpreted as ['price-high','price-med', 'price-low'] and the attribute values will be either 1 or 0 depending on the data), which I don't want since I have to print the decision tree in a certain format which would require the original features (e.g. I need 'price').

Is there a way out of this?

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77

1 Answers1

0

I think pd.get_dummies would be useful since you want to keep track of the original feature names, when creating one-hot vectors.

Example:

df = pd.DataFrame({'price': ['high', 'medium', 'high', 'low'], 'some_feature': ['b', 'a', 'c','a']})
pd.get_dummies(df,columns=['price','some_feature'])

    price_high  price_low   price_medium    some_feature_a  some_feature_b  some_feature_c
0   1   0   0   0   1   0
1   0   0   1   1   0   0
2   1   0   0   0   0   1
3   0   1   0   1   0   0

When feed this dataframe to decision tree, you could get a better understanding!

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • Sure. That would convert the data to one-hot-encoded form. But, the Decision Tree will be constructed on the new features (e.g. price_high, price_low, etc). So while printing the Decision Tree, the features would not be "price" or "some_feature", but "price_high", "price_low", etc. – Sarthak Chakraborty Mar 10 '19 at 16:42
  • yes. why do you want the see just the `price` as feature name when we already created dummies for it. I think, having it as `price_high` would have more explanation of how the split has been made in the decision tree – Venkatachalam Mar 10 '19 at 17:59