I've been looking around the literature regarding feature pyramids for recognition (in my case pedestrians from a 1280 x 960 camera image) and most texts seem to recommend halving the size each step. However, I came across this code which uses:
#define LAMBDA 10
#define SIDE_LENGTH 8
step = powf(2.0f, 1.0f / ((float)LAMBDA));
maxNumCells = W / SIDE_LENGTH;
if( maxNumCells > H / SIDE_LENGTH )
{
maxNumCells = H / SIDE_LENGTH;
}
numStep = (int)(logf((float) maxNumCells / (5.0f)) / logf( step )) + 1;
This gives 46 steps on my image pyramid, right down to a 54 x 40 image, taking about 0.7 seconds per frame! Bringing the LAMBDA value down to 2 gives me 10 steps and near real time output and just as good detection, if not better as I lose less intermediate frames.
So, what am I trading off by lowering my pyramid? Is there a rule of thumb for estimating the steps needed? How small should the top of the pyramid be?
EDIT: The algorithm I am working with is described in Object Detection with Discriminatively Trained Part Based Models.
(I'm also not so confident about the code quality as / logf( step )
can obviously be replaced with * LAMBDA
, but that's another matter altogether)