Defining an (initial) set of Haar Like Features

Question

When it comes to cascade classifiers (using haar like features) I always read that methods like AdaBoosting are used to select the 'best' features for detection. However this only works if there is some initial set of features to begin boosting.

Given a 24x24 pixel image there are 162,336 possible haar features. I might be wrong here, but I don't think libraries like openCV initially test against all of these features.

So my question is how are the initial features selected or how are they generated? Is there any guideline about the initial number of features?

And if all 162,336 features are used initially. How are they generated?

I remember asking myself the same question a few years ago. I even wanted to try and train all of the possibilities, and see if I can get better results. Unfortunately, the cascade trainer implementation I had didn't allow for such a large number to be done in memory, and required a new dedicated implementation. I eventually didn't have time to check it, but would be interested to know if someone figured this out. — Photon, Dec 07 '14 at 09:10
This discussion is an example of how SO can be misleading for people who are learning a subject and that upvote answers based on their initial believes. All is based on a statement which is false: “However this only works if there is some initial set of features to begin boosting”. In fact the original formulation of the Viola Jones and the ones that followed don’t care about it, because it really is not a problem. As I said in my answer, the real bottleneck is the process of gathering negatives sample on the last stages of the cascade. This can cause training to last for many days. — Giuseppe Dini, Dec 21 '15 at 09:44
If someone doesn’t agree, I invite him/she to show me the part of the Opencv source code (that I know well, as I often modify it) or of the Viola-Jones and Lienhart-Maydt papers were it is stated otherwise. — Giuseppe Dini, Dec 21 '15 at 21:56

score 6 · Answer 1 · answered Jan 03 '15 at 16:27

From your question i am able to understand that you wanted to know what are 1,62,336 features.

From 4 original viola jones features(http://en.wikipedia.org/wiki/Viola%E2%80%93Jones_object_detection_framework)

We can generate 1,62,336 features by varying size of 4 original features and their position on 24*24 input image.

For example consider one of the original feature which has two rectangles adjacent to each other. Let us consider size of each rectangle is 1 pixel. Initially if one rectangle is present on (0,0) of 24*24 image then it is considered as one feature & now if you move it horizontally by one pixel( to (1,0) ) then it is considered as second feature as its position is changed to (1,0). In this way u can move it horizontally upto (22,0) generating 23 features. Similarly, if you move along vertical axis from (0,0) up to (0,23) then u can generate 24 features. Now if you move on image covering every position (for example (1,1),(1,2).....(22,23) ) then u can generate 24*23=552 features.

Now if we consider width of each rectangle is 2 pixels and height is 1 pixel. Initially if one rectangle is present on (0,0) and is moved along horizontal axis up to (20,0) as said above then we can have 21 features, as its height is same if we move along vertical axis from (0,0) to (0,23) we can have 24 features. Thus if we move so as to cover every position on image then we can have 24*21=504 features.

In this way if we increase width of each rectangle by one pixel keeping height of each rectangle as 1 pixel every time we cover complete image, so that its width changes from 1 pixel to 24 pixels we get no. of features = 24*(23+21+19.....3+1)

Now, if we consider width of each rectangle is 1 pixel and height as 2 pixel. Initially if one rectangle is present on (0,0) and is moved along horizontal axis up to (23,0) then we can have 23 features as its width is 1 pixel, as its height is 2 pixels if we move along vertical axis from (0,0) to (0,22) then we can have 23 features. Thus if we move so as to cover every position on image then we can have 23*23=529 features.

Similarly, if we increase width of each rectangle by one pixel keeping height of each rectangle as 2 pixels every time we cover complete image, so that its width changes from 1 pixel to 24 pixels we get no. of features = 23*(23+21+19.....3+1)

Now, if we increase height of each rectangle by 1 pixel after changing width of each rectangle from 1 pixel to 24 pixels until height of each rectangle becomes 24 pixels, then

no. of features = 24*(23+21+19.....3+1) + 23*(23+21+19.....3+1) + 22*(23+21+19.....3+1) +.................+ 2*(23+21+19.....3+1) + 1*(23+21+19.....3+1)

            = 43,200 features

Now if we consider 2nd viola jones original feature which has two rectangles with one rectangle above other(that is rectangles are arranged vertically), as this is similar to 1st viola jones original feature it will also have

no. of features = 43,200

Similarly if we follow above process, from 3rd original viola jones feature which has 3 rectangles arranged along horizontal direction, we get

no. of features = 24*(22+19+16+....+4+1) + 23*(22+19+16+....+4+1) + 22*(22+19+16+....+4+1) +................+ 2*(22+19+16+....+4+1) + 1*(22+19+16+....+4+1)

            =27,600

Now, if we consider another feature which has 3 rectangles arranged vertically(that is one rectangle upon another) then we get

no. of features = 27,600 (as it is similar to 3rd original viola jones feature)

Lastly, if we consider 4th original viola jones feature which has 4 rectangles we get

no.of features = 23*(23+21+19+......3+1) + 21*(23+21+19+......3+1) + 19*(23+21+19+......3+1) ..................+ 3*(23+21+19+......3+1) + 1*(23+21+19+......3+1)

           = 20,736

Now summing up all these features we get = 43,200 + 43,200 + 27,600 + 27,600 + 20,736

                                     = 1,62,336 features

Thus from above 1,62,336 features Adaboost selects some of them to form strong classifier.

My problem is that, providing a large enough data set, testing a single original feature on all positions will take a lot of time. Additionally there are many original features one could try, not just the 4 Viola and Jones mentioned. So I was wondering how to select these original features or how an algorithm does that (if there is any). After that one can use AdaBoost to further reduce the amount of features, but how to get that initial set? — , Jan 06 '15 at 08:31

runDOSrun · Accepted Answer · 2015-10-21T23:06:01.710

I presume, you're familiar with Viola/Jones' original work on this topic.

You start by manually choosing a feature type (e.g. Rectangle A). This gives you a mask with which you can train your weak classifiers. In order to avoid moving the mask pixel by pixel and retraining (which would take huge amounts of time and not any better accuracy), you can specify how much the feature moves in x and y direction per trained weak classifier. The size of your jumps depend on your data size. The goal is to have the mask be able to move in and out of the detected object. The size of the feature can also be variable.

After you've trained multiple classifiers with a respective feature (i.e. mask position), you proceed with AdaBoost and Cascade training as usual.

The number of features/weak classifiers is highly dependent on your data and experimental setup (i.e. also the type of classifier you use). You'll need to test the parameters extensibly to also know which type of features work best (rectangle/circle/tetris-like objects etc). I worked on this 2 years ago and it took us quite a long time to evaluate which features and feature-generation-heuristics yielded the best results.

If you wanna start somewhere, just take 1 of the 4 original Viola/Jones features and train a classifier applying it anchored to (0,0). Train the next classifier with (x,0). The next with (2x,0)....(0,y), (0,2y), (0,4y),.. (x,y), (x, 2y) etc... And see what happens. Most likely you'll see that it's ok to have less weak classifiers, i.e. you can proceed to increase the x/y step values which determine how the mask slides. You can also have the mask grow or do other stuff to save time. The reason this "lazy" feature generation works is AdaBoost: as long as these features make the classifiers slightly better than random, AdaBoost will combine these classifiers into a meaningful classifier.

So the initial set of features allways is something handcrafted and there is no rule like: "Choose at least a (complete) basis of wavelets"? — , Dec 07 '14 at 15:05
Your feature pool is hand-crafted try&error, with feedback from Adaboost. What I have done for text detection (6 years ago) was to constrain the geometry and e.g. filter odd coordinates. From this pool I used all features in all stages (integral image makes this cheap). Have a look at Chapter 6.1 in my [Master's thesis about text detection](http://log2.ch/misc/detecting_and_reading_text_in_natural_scenes.pdf) (pdf) for details. — maxy, Dec 07 '14 at 16:09
Yes, it's like maxy said. There are no heuristics that can be applied to every problem and data when choosing the features. — runDOSrun, Dec 07 '14 at 16:30

Giuseppe Dini · Answer 3 · 2015-12-19T10:08:28.260

It seems to me that there is a little bit of confusion here.
Even the accepted answer seems not correct to me (maybe I haven’t got it well). The original Viola-Jones algorithm, the main later improvements of it as the Lienhart-Maydt algorithm, and the Opencv implementation, all of them evaluate each and every feature of the feature set in turn.
You can check the source code of Opencv (and whatever implementation you prefer).
At the end of function void CvHaarEvaluator::generateFeatures() you have numFeatures, which is just 162,336 for BASIC mode and size 24x24.
And all of them are checked in turn, when all the feature set is provided in the form of featureEvaluator (source):

bool isStageTrained = tempStage->train( (CvFeatureEvaluator*)featureEvaluator, curNumSamples,
  _precalcValBufSize, _precalcIdxBufSize, *((CvCascadeBoostParams*)stageParams) );

Every weak classifier is constructed by checking each feature and chosing the one that yields the best result at that point (in case of a decision tree the process is similar).
After this choice, the weights of samples are changed accordingly, so that at the next round a different feature, from all the feature set again, will be selected. A single feature evalution is computationally cheap, but multiplied by numFeatures can be demanding.
The whole training of a cascade can take weeks, but the bottleneck is not the feature evaluation process, it is the negative sample gathering at latest stages.
From the wikipedia link you provided I read:

in a standard 24x24 pixel sub-window, there are a total of M= 162,336 possible features, and it would be prohibitively expensive to evaluate them all when testing an image.

Don’t be mislead by this, it means than after the long training process your detection algorithm should be very fast and it only needs to check few features (just the ones selected during training).

Yes, every single feature in the sample set is concidered by Viola-Jones and (via ada boost) selected to construct a cascade classifier. My question asks about the verry first sample set of these features. This is because the theoreticaly possible set of features is a fully blown haar-wavelet basis plus every possible linear combination of these (as features may be redundnat). Constructing this is nontrivial, searching in this space is as well. So I wanted to know how it is done in application. — , Dec 20 '15 at 11:09
The words “first sample set of these features” don’t make sense. The theoreticaly possible set of features is not a combination of features, but just the feature themselves, one by one. In the training phase it is not considered a combination of features, but only the final result is a combination of features. The strenght of Adaboost is just that it allows you to explore the feature space just by considering isolated features. — Giuseppe Dini, Dec 20 '15 at 11:47
It works like this: you search for the single feature that yields the best result, by checking one by one (not a combination of them). Once you find it (and calculate a coefficient for the weak classifier associated with it), you retain it. Then you modify the weights of the samples and repeat the process. At the end you wil get a set of features with their coefficient that constitute your classifier (in fact just a stage of Viola-Jones). There is not an initial set of feature. Anywhere. — Giuseppe Dini, Dec 20 '15 at 11:53

Defining an (initial) set of Haar Like Features

3 Answers3

Linked