Random forest classifier result from Predict_proba() does not match with predict()?

Question

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('features', FeatureUnion([
    ('Comments',Pipeline([
        ('selector',ItemSelector(column = "Comments")),
        ('tfidf',TfidfVectorizer(use_idf=False,ngram_range=(1,2),max_df=0.95, min_df=0,sublinear_tf=True)),
    ])),
    ('Vendor', Pipeline([
        ('selector',ItemSelector(column = "Vendor Name")),
        ('tfidf',TfidfVectorizer(use_idf=False)),

    ]))
])),
('clf',RandomForestClassifier(n_estimators =200, max_features='log2',criterion = 'entropy',random_state = 45))
 #('clf',LogisticRegression())
 ])


X_train, X_test, y_train, y_test = train_test_split(X,
                                df['code Description'],
                                test_size = 0.3, 
                                train_size = 0.7,
                                random_state = 100)
model = pipeline.fit(X_train, y_train)
s = pipeline.score(X_test,y_test)
pred = model.predict(X_test)
predicted =model.predict_proba(X_test)

for some of classification my predict is matching with prediction score. but in some cases,

proba_predict = [0.3,0.18,0.155]

but instead of classifying it as class A, it is classifying as Class B.

Predict class: B

Actual Class : A

Right side column is my labels and left side column is my input text data:

Can you provide some sample data where this happens? With the provided information we cannot reproduce your results and cannot help. — Merlin1896, May 08 '18 at 19:45
@RafaelC No , I am telling there is some mismatch in result form predic_proba() and predict(). max of predict_proba() corresponding to the class should be my prediction but it is showing second highest as my prediction. — Ayush Agrawal, May 08 '18 at 19:57
Is it worth double checking that `model.classes_` has classes in the same order you anticipate? — Davide Fiocco, May 08 '18 at 20:58
Yae , I am checking it from the scratch , As according to the source code it should predict max only. Thanks @Merlin1896 — Ayush Agrawal, May 08 '18 at 21:00

score 1 · Accepted Answer · edited May 08 '18 at 20:47

1

I think that you state the following situation: For a test vector X_test you find a predicted probability distribution y=[p1, p2, p3] from the predict_proba() method with p1>p2 and p1>p3 but the predict() method does not output class 0 for this vector.

If you inspect the source code of the predict function of sklearn's RandomForestClassifier, you will see that the predict_proba() method of the RandomForest is called there:

proba = self.predict_proba(X)

From these probabilities, the argmax is used to output the class.

Hence, the prediction step uses the predict_proba method for its output. For me it seems impossible that anything goes wrong there.

I would assume that you mixed up some class names in your routine and got confused there. But it is not possible to give a more detailed answer based on the information you provided.

edited May 08 '18 at 20:47

BartoszKP

34,786
15
102
130

answered May 08 '18 at 19:56

Merlin1896

1,751
24
39

Hi @Merlin1896 I'm trying to write a wrapper for random forest regressor. So I tried typing self.predict_proba = super.predict_proba(X) but this gives an error saying super has no attribute predict_proba, which by the way is random forest regressor class – Jeredriq Demas Dec 10 '18 at 00:28
@SamedSivaslıoğlu Please ask a new question with code samples! – Merlin1896 Dec 10 '18 at 06:17
Can I take you here? @Merlin1896 https://stackoverflow.com/questions/53697980/making-random-forest-outputs-like-logistic-regression – Jeredriq Demas Dec 10 '18 at 06:32

Random forest classifier result from Predict_proba() does not match with predict()?

1 Answers1