Is it possible to vectorize this nested for with SSE?

Question

I've never written assembly code for SSE optimization, so sorry if this is a noob question. In this aritcle is explained how to vectorize a for with a conditional statement. However, my code (taken from here ) is of the form:

   for (int j=-halfHeight; j<=halfHeight; ++j)
   {
      for(int i=-halfWidth; i<=halfWidth; ++i)
      {
         const float rx = ofsx + j * a12;
         const float ry = ofsy + j * a22;
         float wx = rx + i * a11;
         float wy = ry + i * a21;
         const int x = (int) floor(wx);
         const int y = (int) floor(wy);
         if (x >= 0 && y >= 0 && x < width && y < height)
         {
            // compute weights
            wx -= x; wy -= y;
            // bilinear interpolation
            *out++ =
               (1.0f - wy) * ((1.0f - wx) * im.at<float>(y,x)   + wx * im.at<float>(y,x+1)) +
               (       wy) * ((1.0f - wx) * im.at<float>(y+1,x) + wx * im.at<float>(y+1,x+1));
         } else {
            *out++ = 0;
         }
      }
   }

So, from my understanding, there are several differences with the linked article:

Here we have a nested for: I've always seen one level for in vectroization, never seen a nested loop
The if condition is based on scalar values (x and y) and not on the array: how can I adapt the linked example to this?
The out index isn't based on i or j (so it's not out[i] or out[j]): how can I fill out in this way?

In particular I'm confused because for indexes are always used as array indexes, while here are used to compute variables while the vector is incremented cycle by cycle

I'm using icpc with -O3 -xCORE-AVX2 -qopt-report=5 and a bunch of others optimization flags. According to Intel Advisor, this is not vectorized, and using #pragma omp simd generates warning #15552: loop was not vectorized with "simd"

Which compiler do you use? And have you confirmed that your compiler hasn't already auto-vectorized this for you? — Jonas, Mar 31 '17 at 12:51
@Jonas thanks for your comment. Please, look at my updated question — justHelloWorld, Mar 31 '17 at 12:55
Very similar to your previous question http://stackoverflow.com/questions/43136182/compiler-doesnt-vectorize-even-with-simd-directive what's changed? — Richard Critten, Mar 31 '17 at 12:56
@RichardCritten oh gosh I totally forgot of that one (and from that you should understand how this is driving me crazy), I just deleted it — justHelloWorld, Mar 31 '17 at 12:58
@RichardCritten I tried to write more about what it confuses me more — justHelloWorld, Mar 31 '17 at 13:00
you can start with changing nested loop into single loop ( deduce i, j from signle counter is easy). You also should not assign to *out++ , assign to out[get_index(counter)], cause if it will be vectorized, order you get in out array can be messed. — Andrew Kashpur, Mar 31 '17 at 13:03
Bilinear interpolation is a rather tricky operation to vectorize, and I wouldn't try it for your first SSE trick. The problem is that the values you need to fetch are not nicely ordered. They're sometimes repeated, sometimes skipped. Any chance you can just use OpenCV or another optimized implementation? — Peter, Mar 31 '17 at 13:06
@AndrewKashpur Thanks for your comment. I agree about reducing to a single loop, and about using `out[something]`, but the main problem here is that we have an `if` stamente not based on the array but on variables — justHelloWorld, Mar 31 '17 at 13:06
@Peter yes, I can! But the problem is that I can't figure out what could be the equivalent opencv function — justHelloWorld, Mar 31 '17 at 13:07

score 4 · Accepted Answer · answered Mar 31 '17 at 13:33

4

Bilinear interpolation is a rather tricky operation to vectorize, and I wouldn't try it for your first SSE trick. The problem is that the values you need to fetch are not nicely ordered. They're sometimes repeated, sometimes skipped. The good news is, interpolating images is a common operation, and you can likely find a pre-written library to do that, like OpenCV

remap() is always a good choice. Just build two arrays of wx and wy which represent the fractional source locations of each pixel, and let remap() do the interpolation.

However, in this case, it looks like an affine transform. That is, the fractional source pixel is related to the source pixel by a 2x3 matrix multiplication. That's the offset and a11/a12/a21/a22 variables. OpenCV has such a transform. Read about it here: http://docs.opencv.org/3.1.0/d4/d61/tutorial_warp_affine.html

All you'll have to do is map your input variables into matrix form and call the affine transform.

answered Mar 31 '17 at 13:33

Peter

14,559
35
55

thanks so much for your answer. I think I kinda understood the process of you explained, but I'm quite confused by how these functions that have to be called. I know that this is kinda incorrect, but could you please write the solution more in details? – justHelloWorld Mar 31 '17 at 14:43
could you please help me on this? – justHelloWorld Apr 11 '17 at 18:18
The tutorial I supplied covers this in great detail. I recommend going through the tutorial. If after that you still have questions, post a new question using your new vocabulary and skills. – Peter Apr 11 '17 at 20:10
Thanks for your comment. Seriously, I understood what warp-affine transformation are about (thanks to your tutorial), but I still have doubts about translating the code above in `cv::warpAffine`. For this reason, I opened [this](http://stackoverflow.com/questions/43364596/how-can-i-rewrite-this-warp-affine-using-opencv) question. Could you please give a loot at it? – justHelloWorld Apr 12 '17 at 08:46

Is it possible to vectorize this nested for with SSE?

1 Answers1

Linked