5

Use case:

I have a small dataset with about 3-10 samples in each class. I am using sklearn SVC to classify those with rbf kernel. I need the confidence of the prediction along with the predicted class. I used predict_proba method of SVC. I was getting weird results with that. I searched a bit and found out that it makes sense only for larger datasets.

Found this question on stack Scikit-learn predict_proba gives wrong answers.

The author of the question verified this by multiplying the dataset, thereby duplicating the dataset.

My questions:

1) If I multiply my dataset by lets say 100, having each sample 100 times, it increases the "correctness" of "predict_proba". What sideeffects will it have? Overfitting?

2) Is there any other way I can calculate the confidence of the classifier? Like distance from the hyperplanes?

3) For this small sample size, is SVM a recommended algorithm or should I choose something else?

Community
  • 1
  • 1
Ishan Jain
  • 262
  • 2
  • 14
  • 1
    What do you mean by "confidence?" Anyway, with only 3 samples, there is not much to hope for in anything you choose. – juanpa.arrivillaga Dec 14 '16 at 05:36
  • 3
    @juanpa.arrivillaga How confident is the classifier that this sample belongs to this class. Platt scaling or distance from the hyperplane? – Ishan Jain Dec 14 '16 at 05:43
  • As @juanpa said - with 3 samples there is nothing reasonable to do, really. In particular SVM makes no sense (and 99% of other statistical methods). You can use 1-NN, which is simply a rule of "attach a label of the closest point", but again - 3 samples per class is way too small for any decent analysis. Unless you have tens of thousands of classes, and there is a structure in between them. – lejlot Dec 14 '16 at 23:07
  • @lejlot I agree that sample size is not good but if it was upto me I would have increased the dataset. But we have made a service for brands where we classify the intent of the statement based on the examples provided by them. Users will only enter about these many examples at first. It may slowly increase but initially I do not expect a lot of samples for training. What if I take each sample 50 times and train it? What will be the side effects of it. I am sorry for such a dumb question. – Ishan Jain Dec 15 '16 at 05:35
  • 3
    Duplicating samples does *nothing*. – lejlot Dec 15 '16 at 09:09
  • For estimating the confidence interval you could also use bootstrapping (see https://stats.stackexchange.com/a/94855/141373 ). – rth Dec 17 '16 at 13:21

1 Answers1

0

First of all: Your data set seems very small for any practical purposes. That being said, let's see what we can do.

SVM's are mainly popular in high dimensional settings. It is currently unclear whether that applies to your project. They build planes on a handful of (or even single) supporting instances, and are often outperformed in situation with large trainingsets by Neural Nets. A priori they might not be your worse choice.

Oversampling your data will do little for an approach using SVM. SVM is based on the notion of support vectors, which are basically the outliers of a class that define what is in the class and what is not. Oversampling will not construct new support vector (I am assuming you are already using the train set as test set).

Plain oversampling in this scenario will also not give you any new information on confidence, other than artififacts constructed by unbalanced oversampling, since the instances will be exact copies and no distibution changes will occur. You might be able to find some information by using SMOTE (Synthetic Minority Oversampling Technique). You will basically generate synthetic instances based of the ones you have. In theory this will provide you with new instances, that won't be exact copies of the ones you have, and might thusly fall a little out of the normal classification. Note: By definition all these examples will lie in between the original examples in your sample space. This will not mean that they will lie in between your projected SVM-space, possibly learning effects that aren't really true.

Lastly, you can estimate confidence with the distance to the hyperplane. Please see: https://stats.stackexchange.com/questions/55072/svm-confidence-according-to-distance-from-hyperline

Community
  • 1
  • 1
S van Balen
  • 288
  • 2
  • 11