How to use RandomForestClassifier with string data

Question

I used sklearn to bulid a RandomForestClassifier model.

There is a string data and folat data in my dataset.

It will show

could not convert string to float

after I run

clf = RandomForestClassifier(n_jobs=100)  
clf.fit(x1, y1)

How can I build a RandomForest model with mixed data?

Check this: http://stackoverflow.com/questions/24715230/can-sklearn-random-forest-directly-handle-categorical-features — Ulisha, Dec 01 '16 at 14:34

score 3 · Answer 1 · answered Dec 02 '16 at 14:21

It is a scikit-learn convention: estimators accept matrices of numbers, not strings or other data types. This allows them to be agnostic to data type - each estimator can handle tabular, text data, images, etc. But it means you need to convert your data (text in your case) to numbers.

There are many ways to convert text to numbers. An easiest is called "Bag of Words" - for each possible word there is a column, and document has 1 (or word count) in a column if a word is present in this document, and 0 otherwise. scikit-learn provides CountVectorizer for that (as well as a few other vectorizers):

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(docs)
clf = RandomForestClassifier()  
clf.fit(X, y)

See http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html for a complete example and http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction for more information about text vectorization.

How to use RandomForestClassifier with string data

1 Answers1