1

I have this function, which is strongly suggested by Intel Advisor to vectorize:

void SIFTDescriptor::samplePatch(float *vec)
{
   for (int r = 0; r < par.patchSize; ++r)
   {
      const int br0 = par.spatialBins * bin0[r]; const float wr0 = w0[r];
      const int br1 = par.spatialBins * bin1[r]; const float wr1 = w1[r];
      for (int c = 0; c < par.patchSize; ++c)
      {
         const float val = mask.at<float>(r,c) * grad.at<float>(r,c);

         const int bc0 = bin0[c];
         const float wc0 = w0[c]*val;
         const int bc1 = bin1[c];
         const float wc1 = w1[c]*val;

         // ori from atan2 is in range <-pi,pi> so add 2*pi to be surely above zero
         const float o = float(par.orientationBins)*(ori.at<float>(r,c) + 2*M_PI)/(2*M_PI);

         int   bo0 = (int)o;
         const float wo1 =  o - bo0;
         bo0 %= par.orientationBins;

         int   bo1 = (bo0+1) % par.orientationBins;
         const float wo0 = 1.0f - wo1;

         // add to corresponding 8 vec...
         if (wr0*wc0>0) {
             vec[br0+bc0+bo0] += wr0*wc0 * wo0;
             vec[br0+bc0+bo1] += wr0*wc0 * wo1;
         }
         if (wr0*wc1>0) {
             vec[br0+bc1+bo0] += wr0*wc1 * wo0;
             vec[br0+bc1+bo1] += wr0*wc1 * wo1;
         }
         if (wr1*wc0>0) {
             vec[br1+bc0+bo0] += wr1*wc0 * wo0;
             vec[br1+bc0+bo1] += wr1*wc0 * wo1;
         }
         if (wr1*wc0>0) {
             vec[br1+bc1+bo0] += wr1*wc0 * wo0;
             vec[br1+bc1+bo1] += wr1*wc0 * wo1;
         }
      }
   }
}

I'm using the intel compiler with the following options:

INTEL_OPT=-O3 -ipo -simd -xCORE-AVX2 -parallel -qopenmp -fargument-noalias -ansi-alias -no-prec-div -fp-model fast=2 -fma -align -finline-functions
INTEL_PROFILE=-g -qopt-report=5 -Bdynamic -shared-intel -debug inline-debug-info -qopenmp-link dynamic -parallel-source-info -ldl

However, Intel Advisor tells me that there are two Read-After-Write dependencies in:

 vec[br0+bc0+bo0] += wr0*wc0 * wo0;

And:

 vec[br1+bc0+bo0] += wr1*wc0 * wo0;

Now, I'm a very beginner with simd and from my understanding I have to write SSE/AVX2/AVX-512 instructions to solve this dependency. For example, I found this question where it is explained how to save in array cells cumulative sums. This is kind different from this since I want to save the result of the cumulative results in array's elements (vec[something] and not a scalar variable like result).

However, in the answer of the second question, it's explained that in order to use that code we need aligned data. Since vec is a pointer to a cv::Mat object, I don't really think that the data is aligned.

In this answer someone argued asking if aligned data are necessary for my problem. In other words, I'm afraid that I'm stucked in a XY problem, where I'm focusing to align data where (maybe) it's not actually needed (especially since I'm a simd beginner and I'm afraid of overthinking).

Note: I'm using a AVX2 compatible machine and I plan to move then to a AVX-512 machine.

Community
  • 1
  • 1
  • It looks like you have a pretty hard cross-iteration dependency for inner c-loop. This is a fundamental algorithm characteristic (espeically because bo0 is a variable & unpredictable function of c), therefore aligning the data will not help to resolve the fundamental issue. Simple simd reduction in pure form will not work as well. What I would maybe try as a first simple attempt - is doing vectorization accross outer r-loop. Try putting #pragma simd before the first for-loop statement (r-loop). Formally, before doing that you need to check Advisor dependencies for r-loop (not for c-loop). – zam Apr 26 '17 at 16:56
  • I know that vector size is 128. So what if I create 8 different vectors (one for each accumulation vector+= ) and sum them at the end? This would solve the dependency right? Notice that both for loops are in range of 41 (so 1681 iterations) – cplusplusuberalles Apr 26 '17 at 17:55

0 Answers0