3

I have text dataset in which I have manually classified each record as either one of two possible classes. I created a TFIDF on the corpus, sans English stopwords, trained/tested a Random Forest classifier, evaluated the model, and applied the model to a larger corpus of text. All is good so far, but how to find out more about my model, i.e., how can find out about which words are "important" the model?

user1624577
  • 547
  • 2
  • 6
  • 15

1 Answers1

5

The trained RF should have an attribute feature_importances_. I think you have to train the model with oob_score=True (in the constructor). The feature importances will tell you which features (data matrix columns) are influential. To get the words, you go back to the tfidf vectorizer and get its vocabulary_ attribute (note the trailing underscore), which is a dict from words to column indices.

For an explanation of the vocabulary_ attribute, see this post: sklearn : TFIDF Transformer : How to get tf-idf values of given words in document

Community
  • 1
  • 1
Dthal
  • 3,216
  • 1
  • 16
  • 10