1

I'm Working on a Machine learning (Data-Mining) project and i'm done with data exploration and data preparation step and it was done in python!

Now I'm facing this issue : i have categoricals attributes in my dataset . After research i've found that the best appropriate algorithm for that kind of data is a decision tree or a random forrest classifier !

But I've read some similar questions about decision tree and categorical attribute and found that the library I'm using (scikit-learn) doesn't works with categoricasl attributes . check here and here , for making it work with categorical i need to encode my categorical variables into numerical ones but i don't want to use encoding because i will loose some properties of my attributes and some informations according to this answer , and also some of my attributes has more than 100 different values.

So I want to know :

  • is there any other python library that can build decision trees with categorical data without any encoding?
  • in this answer it was suggest that other libraries like WEKA can build decisions trees with categorical attributes so my question is this can I combine 2 language in the same machine learning project?

Will do data exploration and preparation in python, train the model in weka (java), and deploy it in a python-flask web app? can it be possible?

Espoir Murhabazi
  • 5,973
  • 5
  • 42
  • 73

1 Answers1

1

The answer you linked about encoding categorical inputs is just saying that you should avoid numerical encoding when your categories don't have an inherent order. It correctly recommends that you use a one-hot encoding in this case.

Simply put, machine learning models operate on numbers, so even if you find a library that takes your raw categories without explicit encoding, it will still have to internally encode them before it can perform any computation.

100 categories is not a lot, and most of the shelf libraries will handle such inputs just fine. I recommend you try xgboost

Imran
  • 12,950
  • 8
  • 64
  • 79
  • ok! Thanks for your response @Imran so i will try one hot encoding sound good but will increase the dimension of my dataset , so have to forget about decision tree and try strong classifier like SVN and NN . – Espoir Murhabazi Jul 18 '17 at 19:28