4

So far I have resourced another post and sklearn documentation

So in general I want to produce the following example:

X = np.matrix([[1,2],[2,3],[3,4],[4,5]])
y = np.array(['A', 'B', 'B', 'C', 'D'])
Xt = np.matrix([[11,22],[22,33],[33,44],[44,55]])
model = model.fit(X, y)
pred = model.predict(Xt)

However for output, I would like to see 3 columns per observation as output from pred:

 A  |  B  |  C
.5  | .2  | .3
.25 | .25 | .5
...

and a different probability for each class showing up in my prediction.

I believe that the best approach would be Multilabel classification from the second link I provided above. Additionally, I think it might be a good idea to hop into one of the multi-label or multi-output models listed below:

Support multilabel:

    sklearn.tree.DecisionTreeClassifier
    sklearn.tree.ExtraTreeClassifier
    sklearn.ensemble.ExtraTreesClassifier
    sklearn.neighbors.KNeighborsClassifier
    sklearn.neural_network.MLPClassifier
    sklearn.neighbors.RadiusNeighborsClassifier
    sklearn.ensemble.RandomForestClassifier
    sklearn.linear_model.RidgeClassifierCV

Support multiclass-multioutput:

    sklearn.tree.DecisionTreeClassifier
    sklearn.tree.ExtraTreeClassifier
    sklearn.ensemble.ExtraTreesClassifier
    sklearn.neighbors.KNeighborsClassifier
    sklearn.neighbors.RadiusNeighborsClassifier
    sklearn.ensemble.RandomForestClassifier

However, I am looking for someone who is has more confidence and experience at doing this the right way. All feedback is appreciated.

-bmc

bmc
  • 817
  • 1
  • 12
  • 23
  • Could you please clarify what exactly you want as an answer? Basically multilable is about attaching >= 0 labels from predefined set of labels to an input example. It might be no labels, 1 label or a bunch of them. As for the probability output for multiclass case - you can obtain it with predict_proba function most of the time for all kinds of classifiers. – Maksim Khaitovich Nov 07 '17 at 00:34
  • "However for output, I would like to see 3 columns per observation as output from pred: A | B | C .5 | .2 | .3 .25 | .25 | .5 ..." is the output I'm excepting. Does predict_proba return a probability for each possible label? – bmc Nov 07 '17 at 01:41
  • 1
    yes, it is basically a function which sklearn tries to implement for every multi-class classifier. For some algorithms though (like svm, which doesn't naturally provide probability estimates) you need to first pass to a classifier an instruction that you want it to estimate class probabilities during training. For instance, for svm it is svc(probability = True). Then predict_proba will give you the probabilities for each class. – Maksim Khaitovich Nov 07 '17 at 04:13

2 Answers2

15

From what I understand you want to obtain probabilities for each of the potential classes for multi-class classifier.

In Scikit-Learn it can be done by generic function predict_proba. It is implemented for most of the classifiers in scikit-learn. You basically call:

clf.predict_proba(X)

Where clf is the trained classifier. As output you will get a decimal array of probabilities for each class for each input value.

One word of caution - not all classifiers naturally evaluate class probabilities. For instance, SVM doesn't do that. You still can obtain the class probabilities though, but to do that upon constructing such classifiers you need to instruct it to perform probability estimation. For SVM it would look like:

SVC(Probability=True)

After you fit it you will be able to use predict_proba as before.

I need to warn you that if classifier doesn't naturally evaluate probabilities that means that the probabilities will be evaluated using rather expansive computational methods which may significantly increase training time. So I advice you to use classifiers which naturally evaluate class probabilities (neural networks with softmax output, logistic regression, gradient boosting etc)

Maksim Khaitovich
  • 4,742
  • 7
  • 39
  • 70
  • 4
    How do you know the order of which label it is giving the probability for? E.g., `y_pred = clf.predict_proba(X_test_tfidf[:len(df_test)])` produces this output `array([[ 0.29354825, 0.08547672, 0.62097503], [ 0.75855171, 0.13965677, 0.10179152], [ 0.39376194, 0.50768248, 0.09855559], ..., [ 0.78636186, 0.0804752 , 0.13316294], [ 0.32583947, 0.06651614, 0.60764439], [ 0.36811811, 0.53192139, 0.0999605 ]])` How do I know what the factor the first, second, and third represent? – bmc Nov 08 '17 at 00:00
  • 6
    @bmc use clf.classes_ which will give you right ordering – Maksim Khaitovich Nov 08 '17 at 04:18
0

Try to use calibrated model:

# define model
model = SVC()
# define and fit calibration model
calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5)
calibrated.fit(trainX, trainy)
# predict probabilities
print(calibrated.predict_proba(testX)[:, 1])
anshuk_pal
  • 195
  • 1
  • 8