1

There are several attempts to optimize calculation of HOG descriptor with using of SIMD instructions: OpenCV, Dlib, and Simd. All of them use scalar code to add resulting magnitude to HOG histogram:

float histogram[height/8][width/8][18];
float ky[height], kx[width];
int idx[size];
float val[size]; 

for(size_t i = 0; i < size; ++i)
{
    histogram[y/8][x/8][idx[i]] += val[i]*ky[y]*kx[x];
    histogram[y/8][x/8 + 1][idx[i]] += val[i]*ky[y]*kx[x + 1];
    histogram[y/8 + 1][x/8][idx[i]] += val[i]*ky[y + 1]*kx[x];
    histogram[y/8 + 1][x/8 + 1][idx[i]] += val[i]*ky[y + 1]*kx[x + 1];
}

There the value of size depends from implementation but in general the meaning is the same.

I know that problem of histogram calculation with using of SIMD does not have a simple and effective solution. But in this case we have small size (18) of histogram. Can it help in SIMD optimizations?

Community
  • 1
  • 1
ErmIg
  • 3,980
  • 1
  • 27
  • 40

2 Answers2

1

I have found solution. It is a temporal buffer. At first we sum histogram to temporary buffer (and this operation can be vectorized). Then we add the sum from buffer to output histogram (and this operation also can be vectorized):

float histogram[height/8][width/8][18];
float ky[height], kx[width];
int idx[size];
float val[size]; 
float buf[18][4];

for(size_t i = 0; i < size; ++i)
{
    buf[idx[i]][0] += val[i]*ky[y]*kx[x];
    buf[idx[i]][1] += val[i]*ky[y]*kx[x + 1];
    buf[idx[i]][2] += val[i]*ky[y + 1]*kx[x];
    buf[idx[i]][3] += val[i]*ky[y + 1]*kx[x + 1];
}

for(size_t i = 0; i < 18; ++i)
{
    histogram[y/8][x/8][i] += buf[i][0];
    histogram[y/8][x/8 + 1][i] += buf[i][1];
    histogram[y/8 + 1][x/8][i] += buf[i][2];
    histogram[y/8 + 1][x/8 + 1][i] += buf[i][3];
}
ErmIg
  • 3,980
  • 1
  • 27
  • 40
0

You can do a partial optimisation by using SIMD to calculate all the (flattened) histogram indices and the bin increments. Then process these in a scalar loop afterwards. You probably also want to strip-mine this such that you process one row at a time, in order to keep the temporary bin indices and increments in cache. It might appear that this would be inefficient, due to the use of temporary intermediate buffers, but in practice I have seen a useful overall gain in similar scenarios.

uint32_t i = 0;

for (y = 0; y < height; ++y)   // for each row
{
    uint32_t inds[width * 4];  // flattened histogram indices for this row
    float vals[width * 4];     // histogram bin increments for this row

    // SIMD loop for this row - calculate flattened histogram indices and bin
    // increments (scalar code shown for reference - converting this loop to
    // SIMD is left as an exercise for the reader...)

    for (x = 0; x < width; ++x, ++i)
    {
        indices[4*x]   = (y/8)*(width/8)*18+(x/8)*18+idx[i];
        indices[4*x+1] = (y/8)*(width/8)*18+(x/8 + 1)*18+idx[i];
        indices[4*x+2] = (y/8+1)*(width/8)*18+(x/8)*18+idx[i];
        indices[4*x+3] = (y/8+1)*(width/8)*18+(x/8 + 1)*18+idx[i];

        vals[4*x]   = val[i]*ky[y]*kx[x];
        vals[4*x+1] = val[i]*ky[y]*kx[x+1];
        vals[4*x+2] = val[i]*ky[y+1]*kx[x];
        vals[4*x+3] = val[i]*ky[y+1]*kx[x+1];
    }

    // scalar loop for this row

    float * const histogram_base = &histogram[0][0][0]; // pointer to flattened histogram

    for (x = 0; x < width * 4; ++x) // for each set of 4 indices/increments in this row
    {
        histogram_base[indices[x]] += vals[x];  // update the (flattened) histogram
    }

}
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • Thank you. The similar optimizations are in the Dlib. But at the end they use scalars to add values in histograms. So your solution fundamentally doesn't differ from it. – ErmIg Apr 10 '17 at 08:47
  • Oh, OK - I'm not familiar with Dlib. I'll leave this answer here in case it's useful to anyone else looking for histogramming optimisation ideas in the future. – Paul R Apr 10 '17 at 08:49
  • 1
    It's partially my fault. Because I haven't wrote all conditions in my question. Thank you for good answer! – ErmIg Apr 10 '17 at 08:54