I have read the original Viola Jones article, the Wikipedia article, the openCV manual and these SO answers:
How does the Viola-Jones face detection method work?
Defining an (initial) set of Haar Like Features
I am trying to implement my own version of the two detectors in the original article (the Adaboost-200-features version and the final cascade version), but something is missing in my understanding, specifically with how the 24 x 24 detector works on the entire image and (maybe) its sub-images on (maybe) different scales. To my understanding, during detection:
(1) The image integral is computed twice, for image variance normalization, once as is, once squared:
The variance of an image sub-window can be computed quickly using a pair of integral images.
(2) The 24 x 24 square detector is moved across the normalized image in steps of 1 pixel, deciding for each square whether it is a face (or a different object) or not:
The final detector is scanned across the image at multiple scales and locations.
(3) Then the image is scaled to be 1.25 smaller, and we go back to (1)
This is done 12 times until the smaller side of the image rectangle is 24 pixels long (288 in original image divided by (1.25 ^ (12 - 1)) is just over 24):
a 384 by 288 pixel image is scanned at 12 scales each a factor of 1.25 larger than the last.
But then I see this quote in the article:
Scaling is achieved by scaling the detector itself, rather than scaling the image. This process makes sense because the features can be evaluated at any scale with the same cost. Good detection results were obtained using scales which are a factor of 1.25 apart.
And I see this quote in the Wikipedia article:
Instead of scaling the image itself (e.g. pyramid-filters), we scale the features.
And I'm thrown off understanding what exactly is going on. How can the 24 x 24 detector be scaled? The entire set of features calculated on this area (whether in the 200-features version or the ~6K-cascade version) is based on originally exploring the 162K possible features of this 24 x 24 rectangle. And why did I get that the pyramid-paradigm still holds for this algorithm?
What should change in the above algorithm? Is it the image which is scaled or the 24 x 24 detector, and if it is the detector how is it exactly done?