Efficient by-hand loop unrolling

Question

I have this C code:

for (k = 0; k < n_n; k++) {
    if (k == i || k == j) continue;
    dd=q2_vect[k]-q1_vect;
    d2=dd*dd;
    if (d2<0) {
        a=1;
        break;
    }       
}

For compiler optimization reasons (on the SPE of the Cell processor), I need to unloop this by hand, so I tried:

dd=q2_vect[0]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;

dd=q2_vect[1]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;

dd=q2_vect[2]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;

.....
.....

// end
goto notdone;

done: 
ok=0;

notdone:
.....

but I do not know how to deal with the

if (k == i || k == j) continue;

and with the fact that the lopp depends on each run on "n_n", and by hand I should write the code so many times as the maximal value "n_n" would get.

How do you think it can be fixed?

And even if you add the necessary parentheses, a good compiler will optimize out the whole if statement because the C language defines no conditions under which it could be true. (The code seems to be trying to rely on undefined behavior.) — R.. GitHub STOP HELPING ICE, Nov 28 '10 at 18:24
not right: `dd == _Imaginary_I` -> `dd*dd == -1.0` and that's valid C. — Vovanium, Nov 28 '10 at 19:22
So you're looking for the first index other than `i` or `j` where `d2` is not zero? Why are you doing multiplication to get `d2` in the first place? The condition then is equivalent to `q2_vect[k]!=q1_vect`... — R.. GitHub STOP HELPING ICE, Nov 28 '10 at 23:02

score 3 · Accepted Answer · answered Nov 28 '10 at 18:20

3

Are you sure the code as written is correct? The current code has undefined behavior if dd is a signed integer type, and the condition in the if is never satisfied if d2 is unsigned or if dd and d2 are floating point types. It looks like you're doing a broken search for the first index k other than i or j where squaring the expression q2_vect[ k]-q1_vect overflows.

As for efficiently skipping the i and j iterations, I would instead just look at where the unrolled "loop" stopped, and restart it at k+1 if k was equal to i or j. This is assuming the code in your loop has no side effects/running total, which is true as written, but I expect you might have meant for the code to do something else (like summing the squares).

Finally, I am highly skeptical of your wish to unroll the loop manually when you don't even seem to have working code to begin with. Any good compiler can unroll the loop for you, but often the type of loop unrolling you're looking to do makes performance worse rather than better. I think you'd do better getting your code to work correctly first, then measuring (and looking at the compiler-generated asm), and only trying to improve on that after you've determined there's a problem.

answered Nov 28 '10 at 18:20

R.. GitHub STOP HELPING ICE

208,859
35
376
711

Are you sure the compiler will unroll this loop automatically? Unless n_n is a constant, I think a compiler will have a hard time determining how to unroll it. – onemasse Nov 28 '10 at 18:38
2

Condition may be satisfied (wuthout undefined behaviour) if dd is `_Imaginary` and d2 is of real type. :) – Vovanium Nov 28 '10 at 19:11
Fully unrolling a loop is almost never useful. Instead the compiler might unroll it to perform 8 or 16 iterations per run, and handle the tail with a construct similar to duff's device. – R.. GitHub STOP HELPING ICE Nov 28 '10 at 19:12
@Vovanium: haha fair enough. Wish I could give you more than +1 for noticing that. Is OP really using complex types though?? – R.. GitHub STOP HELPING ICE Nov 28 '10 at 19:13
1

@R..: This of course very unlikely that author rely on imaginary numbers, but many's belief that x*x>=0 is not true. I i'm angry. :) – Vovanium Nov 28 '10 at 19:40
sorry for the "if" mistake (I wrote it fast) and the (d2<0), it is (d2>0) – Open the way Nov 28 '10 at 23:03

score 1 · Answer 2 · answered Nov 29 '10 at 00:54

This code as written is fairly unsuitable for SPEs since it's so branch-heavy. Also, information on the types of the variables involved would help; the test as written seems fairly obscure (even with the >0 fix), but the code looks like it might be C++ using some sort of vector class that overloads operator - to mean vector subtraction and operator * of two vectors to compute a dot product.

The first thing to do with such simple loops on SPEs is to get them branch-free (at least the inner loop; i.e. unroll a couple of times and only check for early exit every N iterations) and use SIMD instructions: SPEs only have SIMD instructions, so not using SIMD processing in your loops instantly wastes 75% of your available register space and computational power. Similarly, SPEs can only load aligned qwords (16 bytes) at a time, using smaller data types requires extra work to shuffle the contents of registers around so that the value you're trying to load ends up in the "preferred slot".

You get rid of the if (k == i || k == j) by rewriting the first part of the loop using the following branch-free form (this is pseudocode. It's immediately applicable for ints, but you'll need to use intrinsics to get bitwise ops on floats):

dd = q2_vect[k] - q1_vect;
d2 = dd * dd;
d2 &= ~(cmp_equal(k, i) | cmp_equal(k, j));

Here, cmp_equal corresponds to the respective SPE intrinsics (semantics: cmp_equal(a,b) == (a == b) ? ~0u : 0). This forces d2 to zero when k == i or k == j.

To avoid the if (d2 > 0) branch in the inner loop, do the following:

a |= cmp_greater(d2, 0);

and only check if a is nonzero (to early-out) every few loop iterations. If all values computed for d2 are nonnegative (will be the case if your type is ints, floats or a real-valued vector class), you can simplify this further. Just do:

a |= d2;

In the end, a will only be nonzero if all of the individual terms were nonzero. But be careful with integer overflows (if you're using ints) and NaNs (if you're using floats). If you have to handle these cases, the above simplification will break the code.

+1 for SIMD. It's worth noting that if you vectorize the loop, `k` will have different values in each element when you compare it against `i` and `j`, e.g. it will start at {0,1,2,3} and increment by {4,4,4,4} each time through. Also, the `spu_orx` ("OR across") intrinsic might be useful here for checking if any of the comparison results are set. — celion, Nov 29 '10 at 08:22

score 0 · Answer 3 · edited Nov 28 '10 at 18:27

For the first problem, you need to not "execute" the loop body when the condition is met. For this particular problem, you can just place the logical negation of that condition inside the if statement's condition.

Normally unrolling is by a factor; the unrolled code still lives in a loop (unless the loop bounds are known to be very small). Furthermore, you'll need to do the "remainder" of the work (corresponding to the remainder of the problem size divided by the unroll factor) outside the loop.

So, an example of loop unrolling:

for (i = 0; i < n; ++i) do_something(i);

can be unrolled by a factor of 2 to:

for (i = 0; i < n-1; i += 2) { do_something(i); do_something(i+1); }
for (; i < n; ++i) do_something(i);

where the second loop does the "remainder" (it also sets i to be the same thing as the unrolled loop would have, but if i is not needed after this, the whole line can just be if (i < n) etc for this case).

SingleNegationElimination · Answer 4 · 2010-11-28T23:16:36.817

assuming n_n is a compile-time constant, the loop may be trivially unrolled like so:

do
{ 
  k=0
  if (k == i || k == j) 
    ;
  else
  {
    dd=q2_vect[ k]-q1_vect;
    d2=dd*dd;
    if (d2<0)
    {
      a=1;
      break;
    }
  }

  k=1
  if (k == i || k == j) 
    ;
  else
  {
    dd=q2_vect[ k]-q1_vect;
    d2=dd*dd;
    if (d2<0)
    {
      a=1;
      break;
    }
  }

  /* and so on, n_n times */

  k= n_n-1
  if (k == i || k == j) 
    ;
  else
  {
    dd=q2_vect[ k]-q1_vect;
    d2=dd*dd;
    if (d2<0)
    {
      a=1;
      break;
    }
  }

} while (0);

essentially, everything after the continue goes into the else portion of an if statement

Edit: since n_n is not a compile time constant, you can still unroll the loop by doing several runs through the loop in a loop, and then finish with a switch-case statement. in fact, you can combine them like so, this is called the Duff's device.

#define LOOP_BODY              \
do{                            \  
  if (k == i || k == j)        \
    ;                          \
  else                         \
  {                            \
    dd=q2_vect[ k]-q1_vect;    \
    d2=dd*dd;                  \
    if (d2<0)                  \
    {                          \
      a=1;                     \
      break;                   \
    }                          \
  } while (0)          


k = 0;
switch(n_n % 8)
{
  case 0: for (; k < n_n; k++) { LOOP_BODY; k++; 
  case 7:                        LOOP_BODY; k++;
  case 6:                        LOOP_BODY; k++;
  case 5:                        LOOP_BODY; k++;
  case 4:                        LOOP_BODY; k++;
  case 3:                        LOOP_BODY; k++;
  case 2:                        LOOP_BODY; k++;    
  case 1:                        LOOP_BODY; k++;}
}

the problem is: n_n is a runtime value, given input by user. How can I at the same time unroll the loop by-hand n_n times if I do not know the value? — Open the way, Nov 28 '10 at 22:57

score 0 · Answer 5 · answered Nov 28 '10 at 18:22

0

Usually loop unrolling means making the loop contain a few iterations, such that it is run fewer times. For example,

for(i=0;i<count;i++) {
    printf("%d", i);
}

could be unrolled to

i=0;
if(count%2==1) {
    printf("%d", i);
    i=1;
}
while(i<count) {
    printf("%d", i);
    printf("%d", i+1);
    i+=2;
}

answered Nov 28 '10 at 18:22

zebediah49

7,467
1
33
50

what I do not understand or see here, is how can the compiler optimize it – Open the way Nov 28 '10 at 22:55

score 0 · Answer 6 · answered Apr 19 '11 at 23:25

0

Unrolling this loop will not help here much. Inner loop software unrolling helps with software pipelining of instructions to achieve higher IPC at run-time. Here it could corrupt the logic by unrolling.

answered Apr 19 '11 at 23:25

Pari Rajaram

422
3
7

Efficient by-hand loop unrolling

6 Answers6

Linked