3

I've been looking around the literature regarding feature pyramids for recognition (in my case pedestrians from a 1280 x 960 camera image) and most texts seem to recommend halving the size each step. However, I came across this code which uses:

#define LAMBDA 10
#define SIDE_LENGTH 8

step = powf(2.0f, 1.0f / ((float)LAMBDA));
maxNumCells = W / SIDE_LENGTH;
if( maxNumCells > H / SIDE_LENGTH )
{
    maxNumCells = H / SIDE_LENGTH;
}
numStep = (int)(logf((float) maxNumCells / (5.0f)) / logf( step )) + 1;

This gives 46 steps on my image pyramid, right down to a 54 x 40 image, taking about 0.7 seconds per frame! Bringing the LAMBDA value down to 2 gives me 10 steps and near real time output and just as good detection, if not better as I lose less intermediate frames.

So, what am I trading off by lowering my pyramid? Is there a rule of thumb for estimating the steps needed? How small should the top of the pyramid be?

EDIT: The algorithm I am working with is described in Object Detection with Discriminatively Trained Part Based Models.

(I'm also not so confident about the code quality as / logf( step ) can obviously be replaced with * LAMBDA, but that's another matter altogether)

Ken Y-N
  • 14,644
  • 21
  • 71
  • 114

1 Answers1

2

First of all I am no CV/DIP expert so handle this with extreme prejudice...

BTW +1 for very interesting question. I am looking forward to see the other answers here. The first thing that cross my mind (blue it off the balance) looking at your question is why does the code use:

  • logf(powf(2.0f,1.0f))

Such things seen in code make one wonder how good is the rest of it, but it could be leftover from some previous computation method later changed to current state. The same goes for the /logf(step) you spotted/mention.

Anyway answer to your question is use as many steps as is best for your task.

Without knowing the background of what and how you are doing it and lacking the experience with your images and direct implementation is hard to answer better.

In general too much steps can foul detection/classification with feature miss matches because many features will be lost in lower resolutions. So statistically based algorithms can be fouled by this ...

On the other hand too few steps can ignore some scaling issues. And also many similar features can be miss matched because they would not be eliminated by lowering resolutions.

  1. You could try adaptive techniques based on some knowledge of the input.

    For example if you know average complexity of your image then you approximately know how many feature points of interest there should be ... so add steps to your pyramid until you hit the expected number ...

  2. There are also static techniques based on resolution and min accepted detail size.

    The steps is then function of area of image ...

Both approaches requires some research and testing for optimal constants on testing data samples. Then forming the equation/number for the optimal number of pyramid layers of feature points or whatever based on the real input data.

I think option #2 is the case of code you provided

So the /0.5f could be just scaling to different logarithm base or just empirical constant with optimal results for the authors task. Also it could be the radius for feature extraction or some kind of min detail size ... It is hard to tell from uncommented source code it reminds me on this (not really constructive link).

Community
  • 1
  • 1
Spektre
  • 49,595
  • 11
  • 110
  • 380
  • BTW, the dividing by five means `5 * SIDE_LENGTH` is the minimum dimension of the image at the top of the pyramid, which seems to be far too small. – Ken Y-N Nov 11 '15 at 05:16