How to get Top 3 or Top N predictions using sklearn's SGDClassifier

Question

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import linear_model
arr=['dogs cats lions','apple pineapple orange','water fire earth air', 'sodium potassium calcium']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(arr)
feature_names = vectorizer.get_feature_names()
Y = ['animals', 'fruits', 'elements','chemicals']
T=["eating apple roasted in fire and enjoying fresh air"]
test = vectorizer.transform(T)
clf = linear_model.SGDClassifier(loss='log')
clf.fit(X,Y)
x=clf.predict(test)
#prints: elements

In the above code, clf.predict() prints only 1 best prediction for a sample from list X. I am interested in top 3 predictions for a particular sample in the list X, i know the function predict_proba/predict_log_proba returns a list of all probabilities for each feature in list Y, but it has to sorted and then associated with the features in list Y before getting the top 3 results. Is there any direct and efficient way?

@user1269942: Very helpful addition there! However, I did not totally understand the function of the "truths" variable. Can you elaborate please? — Statmonger, Feb 24 '19 at 12:34

score 21 · Accepted Answer · edited Mar 28 '19 at 20:24

21

There is no built-in function, but what is wrong with

probs = clf.predict_proba(test)
best_n = np.argsort(probs, axis=1)[-n:]

?

As suggested by one of the comment, should change [-n:] to [:,-n:]

probs = clf.predict_proba(test)
best_n = np.argsort(probs, axis=1)[:,-n:]

edited Mar 28 '19 at 20:24

Diansheng

1,081
12
19

answered Sep 10 '15 at 00:20

Andreas Mueller

27,470
8
62
74

1

No names associated with the corresponding probability! – Pranay Mathur Sep 11 '15 at 18:44
3

clf.classes_ provides these. – Andreas Mueller Sep 13 '15 at 15:31
9

Shouldn't the slicing be `best_n = np.argsort(probs, axis=1)[:, -n:]` ? – MCMZL Aug 03 '18 at 08:09
1

Hey @AndreasMueller, exactly how would you map the clf.classes_ to the probs? I am new to Python and have spent the last 90 minutes trying to figure out how to map the indices of the classes to the actual classes...thanks! – sweetmusicality Mar 30 '19 at 08:16
one more question: the slicing with best_n provides 3rd best, 2nd best, then 1st best. how could you reverse the order so it's 1st, 2nd, then 3rd? thanks again! – sweetmusicality Mar 30 '19 at 08:30
woohoo once again figured it out! feel free to delete these comments if they seem irrelevant, but i am thinking they might help others. `best_n = np.argsort(-probs, axis = 1)[:, :3]` – sweetmusicality Mar 30 '19 at 08:36
I seem to be getting the right results but my is `[:,-n:]` the correct slicing instead of `[-n:]`? – Hamman Samuel Jan 29 '20 at 19:59

score 11 · Answer 2 · answered Feb 01 '18 at 21:31

I know this has been answered...but I can add a bit more...

#both preds and truths are same shape m by n (m is number of predictions and n is number of classes)
def top_n_accuracy(preds, truths, n):
    best_n = np.argsort(preds, axis=1)[:,-n:]
    ts = np.argmax(truths, axis=1)
    successes = 0
    for i in range(ts.shape[0]):
      if ts[i] in best_n[i,:]:
        successes += 1
    return float(successes)/ts.shape[0]

It's quick and dirty but I find it useful. One can add their own error checking, etc..

valearner · Answer 3 · 2017-08-09T19:13:32.780

5

Hopefully, Andreas will help with this. predict_probs is not available when loss='hinge'. To get top n class when loss='hinge' do:

calibrated_clf = CalibratedClassifierCV(clfSDG, cv=3, method='sigmoid')
model = calibrated_clf.fit(train.data, train.label)

probs = model.predict_proba(test_data)
sorted( zip( calibrated_clf.classes_, probs[0] ), key=lambda x:x[1] )[-n:]

Not sure if clfSDG.predict and calibrated_clf.predict will always predict the same class.

edited Aug 09 '17 at 19:13

answered Aug 09 '17 at 19:05

valearner

587
2
7
14

Given that log-loss for SGDClassifier is OvR, you could also just rank by ``decision_function`` and it wouldn't really be worse in any way. Using CalibratedClassifierCV is probably better, but orthogonal to the question. I would use LogisticRegression(multiclass='multinomial'). – Andreas Mueller Aug 09 '17 at 20:07

score 2 · Answer 4 · answered Jun 30 '20 at 13:07

argsort gives results in ascending order, if you want to save yourself with unusual loops or confusion you can use a simple trick.

probs = clf.predict_proba(test)
best_n = np.argsort(-probs, axis=1)[:, :n]

Negating the probabilities will turn smallest to largest and hence you can take top-n results in descending order.

score 0 · Answer 5 · answered Aug 20 '20 at 19:34

As @FredFoo described in How do I get indices of N maximum values in a NumPy array? a faster method would be to use argpartition.

Newer NumPy versions (1.8 and up) have a function called argpartition for this. To get the indices of the four largest elements, do

>>> a = np.array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
>>> a array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
>>> ind = np.argpartition(a, -4)[-4:]
>>> ind array([1, 5, 8, 0])
>>> a[ind] array([4, 9, 6, 9])

Unlike argsort, this function runs in linear time in the worst case, but the returned indices are not sorted, as can be seen from the result of evaluating a[ind]. If you need that too, sort them afterwards:

>>> ind[np.argsort(a[ind])] array([1, 8, 5, 0])

To get the top-k elements in sorted order in this way takes O(n + k log k) time.

score 0 · Answer 6 · answered Mar 09 '21 at 17:35

I wrote a function that outputs a dataframe with the top n predictions and their probabilities, and ties it back to class names. Hope this is helpful!

def return_top_n_pred_prob_df(n, model, X_test, column_name):
  predictions = model.predict_proba(X_test)
  preds_idx = np.argsort(-predictions) 
  classes = pd.DataFrame(model.classes_, columns=['class_name'])
  classes.reset_index(inplace=True)
  top_n_preds = pd.DataFrame()
  for i in range(n):
        top_n_preds[column_name + '_prediction_{}_num'.format(i)] =     [preds_idx[doc][i] for doc in range(len(X_test))]
    top_n_preds[column_name + '_prediction_{}_probability'.format(i)] = [predictions[doc][preds_idx[doc][i]] for doc in range(len(X_test))]
    top_n_preds = top_n_preds.merge(classes, how='left', left_on= column_name + '_prediction_{}_num'.format(i), right_on='index')
    top_n_preds = top_n_preds.rename(columns={'class_name': column_name + '_prediction_{}'.format(i)})
    try: top_n_preds.drop(columns=['index', column_name + '_prediction_{}_num'.format(i)], inplace=True) 
    except: pass
  return top_n_preds

How to get Top 3 or Top N predictions using sklearn's SGDClassifier

6 Answers6

?