2

I’m using Visual Recognition service on IBM Bluemix.

I have created some classifiers, in particular two of these with this objective:

  • first: a “generic” classifier that has to return the score of confidence about the recognition of a particular object in the image. I’ve trained it with 50 positive examples of the object, and 50 negative examples of something similar to the object (details of it, its components, images alike it etc.).
  • second: a more specific classifier that recognize the particular type of the object identified before, if the score of the first classification is quite high. This new classifier has been trained as the first one: 50 positive examples of type A object, 50 negative examples of type B object. This second categorization should be more specific that the first one, because the images are more detailed and are all similar among them.

The result is that the two classifiers work well, and the expected results of a particular set of images correspond to the truth in most cases, and this should mean that both have been well trained.

But there is a thing that I don’t understand.

In both classifiers, if I try to classify one of the images that have been used in the positive training set, my expectation is that the confidence score should be near to 90-100%. Instead, I always obtain a score that is included in the range between 0.50 and 0.55. Same thing happens when I try with an image very similar to one of the positive training set (scaled, reflected, cut out etc.): the confidence never goes above 0.55 circa.

I’ve tried to create a similar classifier with 100 positive images and 100 negative images, but the final result never change.

The question is: why the confidence score is so low? why it is not near to 90-100% with images used in the positive training set?

Dieghitus
  • 55
  • 3
  • 9

1 Answers1

4

The scores from Visual Recognition custom classifiers range from 0.0 to 1.0, but they are unitless and are not percentages or probabilities. (They do not add up to 100% or 1.0)

When the service creates a classifier from your examples, it is trying to figure out what distinguishes the features of one class of positive_examples from the other classes of positive_examples (and negative_examples, if given). The scores are based on the distance to a decision boundary between the positive examples for the class and everything else in the classifier. It attempts to calibrate the score output for each class so that 0.5 is a decent decision threshold, to say whether something belongs to the class.

However, given the cost-benefit balance of false alarms vs. missed detections in your application, you may want to use a higher or lower threshold for deciding whether an image belongs to a class.

Without knowing the specifics of your class examples, I might guess that there is a significant amount of similarity between your classes, that maybe in the feature space your examples are not in distinct clusters, and that the scores reflect this closeness to the boundary.

Matt Hill
  • 1,081
  • 6
  • 4
  • Thanks! I still have a problem since my app is for a customer that has to be informed on the confidence level of his research through the service. Since it should hide all the inner technical features that are related with distance from boundary between decision regions (at least for a normal user), I wonder how the returned adimensional number could be interpreted and somehow transformed to represent something useful for a “common man”. I need to extract a parameter (percentage would be the top) to let the user understand the usefulness of his research in a simple, direct way. How could I do? – Dieghitus Aug 02 '16 at 16:45
  • There are several ways - Here's one: 1. Assemble a set of labeled data "L" that was not used in training the classifier. 2. Split L into 2 sets, V and T - validation and testing. 3. Run V through your classifier and pick a score threshold "R" which optimizes the correctness metric you value, such as top-5 precision, across all of V. 4. From T, select a random subset "Q" and classify it using your classifier and "R". Compute the probability of a correct classification on Q. That's 1 experiment. 5. Repeat #4 with a different Q from T, compute average % correct across all experiments – Matt Hill Aug 03 '16 at 18:43
  • 1
    Sorry for the delay. I've done what you've suggested and I think that the final result of the validation and testing process is quite good. Thank you! – Dieghitus Aug 10 '16 at 08:53
  • Great! Glad to hear it. – Matt Hill Aug 11 '16 at 15:58