2

I have read the original Viola Jones article, the Wikipedia article, the openCV manual and these SO answers:

How does the Viola-Jones face detection method work?

Defining an (initial) set of Haar Like Features

I am trying to implement my own version of the two detectors in the original article (the Adaboost-200-features version and the final cascade version), but something is missing in my understanding, specifically with how the 24 x 24 detector works on the entire image and (maybe) its sub-images on (maybe) different scales. To my understanding, during detection:

(1) The image integral is computed twice, for image variance normalization, once as is, once squared:

The variance of an image sub-window can be computed quickly using a pair of integral images.

(2) The 24 x 24 square detector is moved across the normalized image in steps of 1 pixel, deciding for each square whether it is a face (or a different object) or not:

The final detector is scanned across the image at multiple scales and locations.

(3) Then the image is scaled to be 1.25 smaller, and we go back to (1)

This is done 12 times until the smaller side of the image rectangle is 24 pixels long (288 in original image divided by (1.25 ^ (12 - 1)) is just over 24):

a 384 by 288 pixel image is scanned at 12 scales each a factor of 1.25 larger than the last.

But then I see this quote in the article:

Scaling is achieved by scaling the detector itself, rather than scaling the image. This process makes sense because the features can be evaluated at any scale with the same cost. Good detection results were obtained using scales which are a factor of 1.25 apart.

And I see this quote in the Wikipedia article:

Instead of scaling the image itself (e.g. pyramid-filters), we scale the features.

And I'm thrown off understanding what exactly is going on. How can the 24 x 24 detector be scaled? The entire set of features calculated on this area (whether in the 200-features version or the ~6K-cascade version) is based on originally exploring the 162K possible features of this 24 x 24 rectangle. And why did I get that the pyramid-paradigm still holds for this algorithm?

What should change in the above algorithm? Is it the image which is scaled or the 24 x 24 detector, and if it is the detector how is it exactly done?

Giora Simchoni
  • 3,487
  • 3
  • 34
  • 72
  • 1
    scaling the filter up or scaling the image down should be the same result (in theory). For many algorithms it is desirable to reduce the image size and hold the filter small, because it is faster (increasing a 2D filter will scale up computation time strongly). But if the filter is optimized for integral images (like box-filters or haar-like features) the filter size doesn't really matter, because the number of fetched individual pixel values in the integral image will be independent from the filter size. – Micka Jun 30 '17 at 16:18
  • But how would you scale the filter up? If one of the filter's feature focuses at a 2x1 square for example, decreasing the lower pixwl value from the upper, and the filter is scaled 1.5 times larger - this means the 2x1 focus turns to 3x1, and you can't subtract the "lower" region from the "upper" anymore, you can't split the middle pixel. – Giora Simchoni Jun 30 '17 at 17:20
  • No idea about how to "interpolate" there. Maybe have a look at the openCV source code on how they solved it. The "old style" haar cascade classifiers afaik uses the integral images, while the "new style" haar cascade classifiers downscale the image instead. – Micka Jun 30 '17 at 17:26
  • probably they'll scale either both regions (positive/negative) or none of them, ending in a 4x2 or a 2x1 filter. But never implemented it myself, so that's just a guess. – Micka Jun 30 '17 at 17:37
  • or maybe they interpolate within the integral image to fetch supixel positions to access like pixel-position [0,0] and [1.5,0] by interpolating between position [1,0] and [2,0] but not sure whether this is "allowed" (=makes sense) in integral images. – Micka Jun 30 '17 at 17:43

0 Answers0