10

I'm trying to do multilabel classification with SVM. I have nearly 8k features and also have y vector of length with nearly 400. I already have binarized Y vectors, so I didn't use MultiLabelBinarizer() but when I use it with my Y data's raw form, it still gives same thing.

I'm running this code:

X = np.genfromtxt('data_X', delimiter=";")
Y = np.genfromtxt('data_y', delimiter=";")
training_X = X[:2600,:]
training_y = Y[:2600,:]

test_sample = X[2600:2601,:]
test_result = Y[2600:2601,:]

classif = OneVsRestClassifier(SVC(kernel='rbf'))
classif.fit(training_X, training_y)
print(classif.predict(test_sample))
print(test_result)

After all fitting process when it comes to prediction part, it says Label not x is present in all training examples (x is a few different numbers in range of my y vector length which is 400). After that it gives predicted y vector which is always zero vector with length of 400(y vector length). I'm new at scikit-learn and also in machine learning. I couldn't figure out the problem here. What's the problem and what should I do to fix it? Thanks.

malisit
  • 1,253
  • 2
  • 17
  • 36

1 Answers1

16

There are 2 problems here:

1) The missing label warning
2) You are getting all 0's for predictions

The warning means that some of your classes are missing from the training data. This is a common problem. If you have 400 classes, then some of them must only occur very rarely, and on any split of the data, some classes may be missing from one side of the split. There may also be classes that simply don't occur in your data at all. You could try Y.sum(axis=0).all() and if that is False, then some classes do not occur even in Y. This all sounds horrible, but realistically, you aren't going to be able to correctly predict classes that occur 0, 1, or any very small number of times anyway, so predicting 0 for those is probably about the best you can do.

As for the all-0 predictions, I'll point out that with 400 classes, probably all of your classes occur much less than half the time. You could check Y.mean(axis=0).max() to get the highest label frequency. With 400 classes, it might only be a few percent. If so, a binary classifier that has to make a 0-1 prediction for each class will probably pick 0 for all classes on all instances. This isn't really an error, it is just because all of the class frequencies are low.

If you know that each instance has a positive label (at least one), you could get the decision values (clf.decision_function) and pick the class with the highest one for each instance. You'll have to write some code to do that, though.

I once had a top-10 finish in a Kaggle contest that was similar to this. It was a multilabel problem with ~200 classes, none of which occurred with even a 10% frequency, and we needed 0-1 predictions. In that case I got the decision values and took the highest one, plus anything that was above a threshold. I chose the threshold that worked the best on a holdout set. The code for that entry is on Github: Kaggle Greek Media code. You might take a look at it.

If you made it this far, thanks for reading. Hope that helps.

Dthal
  • 3,216
  • 1
  • 16
  • 10
  • 1
    Hi, thanks for an answer with lots of useful stuff. I tried `Y.sum(axis=0).all()` and it returned True. Also, I tried `Y.mean(axis=0).max()` and it returned `0.315981070258`. Should I still implement a `clf.decision_function`? Can you be more specific about it, how to implement and so? I'm sorry, I'm very new at these stuff, so I couldn't understdant what to do with `decision_function`. – malisit Jan 02 '16 at 01:46
  • 1
    I'm saying that if you are getting all zero predictions, and you know that there should be some 1's in there, you could try getting decision values instead, and predicting 1 whenever that is above some threshold. Your predicted labels would be: `(decision_value > threshold).astype(float)`. The threshold will be less than 0, because 0 is the threshold that the classifier is using and its not getting any positives. Alternatively, if you know that there is at least one positive label per instance, you could pick the label that has the highest DV (it will still be negative). – Dthal Jan 02 '16 at 02:09
  • Thanks! The intuition and the code you provided on GitHub really helped. – malisit Jan 02 '16 at 16:31
  • @Dthal why use a decision function and not the prediction probabilities? I think prediction probabilities will give us a better accuracy by choosing the label corresponding to the class with highest probability. – sandyp Aug 24 '17 at 04:14
  • @sandyp- In the sklearn docs (http://scikit-learn.org/stable/modules/svm.html#scores-and-probabilities), it says that Platt scaling is used to get the probabilities. That is a kludge as noted there. The docs also say: "If confidence scores are required, but these do not have to be probabilities, then it is advisable to set probability=False and use decision_function instead of predict_proba.". Usually decision_function and predict_proba are monotonically related, so if you need a prediction of the best label, you can use either. – Dthal Aug 24 '17 at 22:16