I have to use Decision Tree classifier to classify certain data. However, the attribute values are strings, and as I found here, it said that strings cannot be used as an input. Hence I used integer encoding for the strings.
In this article, I found out that passing integer-encoded data may result in a wrong answer since sklearn assumes an ordering among the data. So, the only way out is using OneHotEncoder
module.
Using OneHotEncoder
module increases the number of features (e.g. if there is an attribute 'price' with values ['high','med','low']
, one-hot-encoding would result in inclusion of 3 attributes related to the actual attribute 'price'; those can be interpreted as ['price-high','price-med', 'price-low']
and the attribute values will be either 1 or 0 depending on the data), which I don't want since I have to print the decision tree in a certain format which would require the original features (e.g. I need 'price').
Is there a way out of this?