1

Basically, I am using some data mining algorithms from python sk-learn library to do a classification.

However, I got some very un-balanced results, say, around 0.99 recall rate and less than 0.1 precision rate.

Conceptually classification algorithms rely on some "threshold" to make the decision, which means I should be able to balance the precision and recall rate simply through adjusting this "threshold".

However, I cannot find some APIs in sklearn to help on this, so my question is: How can I manipulate the underlying "threshold" inside sklearn library to balance the precision and recall rate?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
lllllllllllll
  • 8,519
  • 9
  • 45
  • 80
  • I don't know about this kind of threshold in `sklearn`, but before starting to search about it, could you tell me if your data is **imbalanced**? I just want to be sure that you don't have something totally imbalanced, like a proportion in classification `90:1` (for every 90 samples in class A, only 1 belongs to B, for example) – Guiem Bosch Feb 16 '16 at 18:41
  • @Guiem Thank you, I have a 50:50 samples. – lllllllllllll Feb 16 '16 at 18:51

1 Answers1

0

Ok, if your problem is not about unbalanced data I must refer you to some notes I learnt in Andrew Ng's Machine learning course: http://www.holehouse.org/mlclass/06_Logistic_Regression.html

I chose Logistic Regression here because I don't really know the methods you are using. But the conclusion basically is that a threshold is not an explicit parameter of your learner model. I mean, you can chose afterwards where are you going to cut the classification (in probabilistic models) or you can establish some weighting parameters in some other methods (check this answer: scikit .predict() default threshold).

This thresholds only account for the proportion of false positives/false negatives (precision/recall) and shouldn't be strictly considered as parameters of the learning algorithm.

Side note: in a specific classification problem I found 'empirically' that I needed at least a probability of 0.6 to be right so I used the classifier's method predict_proba instead of predict so it was me who finally decided the returned class. Don't know if that helps.

Community
  • 1
  • 1
Guiem Bosch
  • 2,728
  • 1
  • 21
  • 37
  • Hello Guiem, thank you for your response. Then what about other mining methods, I find that not all the mining methods have the `class_prior` parameter. How about `decisiontree`? or `svm`? – lllllllllllll Feb 16 '16 at 20:06
  • Yeah, I know, those methods usually have the `class_weight`, which you could try to play with, btw. I mean, don't set it to "balanced" because we already know your data is balanced. But as you say you have low precision you should focus on the False Positives. – Guiem Bosch Feb 16 '16 at 20:26
  • And as I told you before, you can output the `precision_probabilities` so imagine I'm on a typical classification problem: 'is there a human face on a picture?'. A case of low precision implies a high rate of False Positives, lot's of cases where I say 'yes, there is a face in this picture' but it actually isn't. So every time you are going to say 'yes', you could check at the probability of saying yes and if that's not above 0.7, just to say something, you could omit the positive classification. – Guiem Bosch Feb 16 '16 at 20:29
  • btw, another possible issue. You say your ratio is 50:50, but is this ratio maintained in your training sets? I mean, just imagine you don't split wisely and proportion is not kept in training and test sets. If that's the case, `sklearn.cross_validation.train_test_split()` would be a solution and wisely split your data. – Guiem Bosch Feb 16 '16 at 20:33