1

I have this C code:

for (k = 0; k < n_n; k++) {
    if (k == i || k == j) continue;
    dd=q2_vect[k]-q1_vect;
    d2=dd*dd;
    if (d2<0) {
        a=1;
        break;
    }       
}  

For compiler optimization reasons (on the SPE of the Cell processor), I need to unloop this by hand, so I tried:

dd=q2_vect[0]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;

dd=q2_vect[1]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;

dd=q2_vect[2]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;

.....
.....

// end
goto notdone;

done: 
ok=0;

notdone:
.....

but I do not know how to deal with the

if (k == i || k == j) continue;

and with the fact that the lopp depends on each run on "n_n", and by hand I should write the code so many times as the maximal value "n_n" would get.

How do you think it can be fixed?

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
Open the way
  • 26,225
  • 51
  • 142
  • 196

6 Answers6

3

Are you sure the code as written is correct? The current code has undefined behavior if dd is a signed integer type, and the condition in the if is never satisfied if d2 is unsigned or if dd and d2 are floating point types. It looks like you're doing a broken search for the first index k other than i or j where squaring the expression q2_vect[ k]-q1_vect overflows.

As for efficiently skipping the i and j iterations, I would instead just look at where the unrolled "loop" stopped, and restart it at k+1 if k was equal to i or j. This is assuming the code in your loop has no side effects/running total, which is true as written, but I expect you might have meant for the code to do something else (like summing the squares).

Finally, I am highly skeptical of your wish to unroll the loop manually when you don't even seem to have working code to begin with. Any good compiler can unroll the loop for you, but often the type of loop unrolling you're looking to do makes performance worse rather than better. I think you'd do better getting your code to work correctly first, then measuring (and looking at the compiler-generated asm), and only trying to improve on that after you've determined there's a problem.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • Are you sure the compiler will unroll this loop automatically? Unless n_n is a constant, I think a compiler will have a hard time determining how to unroll it. – onemasse Nov 28 '10 at 18:38
  • 2
    Condition may be satisfied (wuthout undefined behaviour) if dd is `_Imaginary` and d2 is of real type. :) – Vovanium Nov 28 '10 at 19:11
  • Fully unrolling a loop is almost never useful. Instead the compiler might unroll it to perform 8 or 16 iterations per run, and handle the tail with a construct similar to duff's device. – R.. GitHub STOP HELPING ICE Nov 28 '10 at 19:12
  • @Vovanium: haha fair enough. Wish I could give you more than +1 for noticing that. Is OP really using complex types though?? – R.. GitHub STOP HELPING ICE Nov 28 '10 at 19:13
  • 1
    @R..: This of course very unlikely that author rely on imaginary numbers, but many's belief that x*x>=0 is not true. I i'm angry. :) – Vovanium Nov 28 '10 at 19:40
  • sorry for the "if" mistake (I wrote it fast) and the (d2<0), it is (d2>0) – Open the way Nov 28 '10 at 23:03
1

This code as written is fairly unsuitable for SPEs since it's so branch-heavy. Also, information on the types of the variables involved would help; the test as written seems fairly obscure (even with the >0 fix), but the code looks like it might be C++ using some sort of vector class that overloads operator - to mean vector subtraction and operator * of two vectors to compute a dot product.

The first thing to do with such simple loops on SPEs is to get them branch-free (at least the inner loop; i.e. unroll a couple of times and only check for early exit every N iterations) and use SIMD instructions: SPEs only have SIMD instructions, so not using SIMD processing in your loops instantly wastes 75% of your available register space and computational power. Similarly, SPEs can only load aligned qwords (16 bytes) at a time, using smaller data types requires extra work to shuffle the contents of registers around so that the value you're trying to load ends up in the "preferred slot".

You get rid of the if (k == i || k == j) by rewriting the first part of the loop using the following branch-free form (this is pseudocode. It's immediately applicable for ints, but you'll need to use intrinsics to get bitwise ops on floats):

dd = q2_vect[k] - q1_vect;
d2 = dd * dd;
d2 &= ~(cmp_equal(k, i) | cmp_equal(k, j));

Here, cmp_equal corresponds to the respective SPE intrinsics (semantics: cmp_equal(a,b) == (a == b) ? ~0u : 0). This forces d2 to zero when k == i or k == j.

To avoid the if (d2 > 0) branch in the inner loop, do the following:

a |= cmp_greater(d2, 0);

and only check if a is nonzero (to early-out) every few loop iterations. If all values computed for d2 are nonnegative (will be the case if your type is ints, floats or a real-valued vector class), you can simplify this further. Just do:

a |= d2;

In the end, a will only be nonzero if all of the individual terms were nonzero. But be careful with integer overflows (if you're using ints) and NaNs (if you're using floats). If you have to handle these cases, the above simplification will break the code.

Fabian Giesen
  • 3,231
  • 16
  • 16
  • +1 for SIMD. It's worth noting that if you vectorize the loop, `k` will have different values in each element when you compare it against `i` and `j`, e.g. it will start at {0,1,2,3} and increment by {4,4,4,4} each time through. Also, the `spu_orx` ("OR across") intrinsic might be useful here for checking if any of the comparison results are set. – celion Nov 29 '10 at 08:22
0

For the first problem, you need to not "execute" the loop body when the condition is met. For this particular problem, you can just place the logical negation of that condition inside the if statement's condition.

Normally unrolling is by a factor; the unrolled code still lives in a loop (unless the loop bounds are known to be very small). Furthermore, you'll need to do the "remainder" of the work (corresponding to the remainder of the problem size divided by the unroll factor) outside the loop.

So, an example of loop unrolling:

for (i = 0; i < n; ++i) do_something(i);

can be unrolled by a factor of 2 to:

for (i = 0; i < n-1; i += 2) { do_something(i); do_something(i+1); }
for (; i < n; ++i) do_something(i);

where the second loop does the "remainder" (it also sets i to be the same thing as the unrolled loop would have, but if i is not needed after this, the whole line can just be if (i < n) etc for this case).

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
lijie
  • 4,811
  • 22
  • 26
0

assuming n_n is a compile-time constant, the loop may be trivially unrolled like so:

do
{ 
  k=0
  if (k == i || k == j) 
    ;
  else
  {
    dd=q2_vect[ k]-q1_vect;
    d2=dd*dd;
    if (d2<0)
    {
      a=1;
      break;
    }
  }

  k=1
  if (k == i || k == j) 
    ;
  else
  {
    dd=q2_vect[ k]-q1_vect;
    d2=dd*dd;
    if (d2<0)
    {
      a=1;
      break;
    }
  }

  /* and so on, n_n times */

  k= n_n-1
  if (k == i || k == j) 
    ;
  else
  {
    dd=q2_vect[ k]-q1_vect;
    d2=dd*dd;
    if (d2<0)
    {
      a=1;
      break;
    }
  }

} while (0);

essentially, everything after the continue goes into the else portion of an if statement

Edit: since n_n is not a compile time constant, you can still unroll the loop by doing several runs through the loop in a loop, and then finish with a switch-case statement. in fact, you can combine them like so, this is called the Duff's device.

#define LOOP_BODY              \
do{                            \  
  if (k == i || k == j)        \
    ;                          \
  else                         \
  {                            \
    dd=q2_vect[ k]-q1_vect;    \
    d2=dd*dd;                  \
    if (d2<0)                  \
    {                          \
      a=1;                     \
      break;                   \
    }                          \
  } while (0)          


k = 0;
switch(n_n % 8)
{
  case 0: for (; k < n_n; k++) { LOOP_BODY; k++; 
  case 7:                        LOOP_BODY; k++;
  case 6:                        LOOP_BODY; k++;
  case 5:                        LOOP_BODY; k++;
  case 4:                        LOOP_BODY; k++;
  case 3:                        LOOP_BODY; k++;
  case 2:                        LOOP_BODY; k++;    
  case 1:                        LOOP_BODY; k++;}
}
SingleNegationElimination
  • 151,563
  • 33
  • 264
  • 304
  • the problem is: n_n is a runtime value, given input by user. How can I at the same time unroll the loop by-hand n_n times if I do not know the value? – Open the way Nov 28 '10 at 22:57
0

Usually loop unrolling means making the loop contain a few iterations, such that it is run fewer times. For example,

for(i=0;i<count;i++) {
    printf("%d", i);
}

could be unrolled to

i=0;
if(count%2==1) {
    printf("%d", i);
    i=1;
}
while(i<count) {
    printf("%d", i);
    printf("%d", i+1);
    i+=2;
}
zebediah49
  • 7,467
  • 1
  • 33
  • 50
0

Unrolling this loop will not help here much. Inner loop software unrolling helps with software pipelining of instructions to achieve higher IPC at run-time. Here it could corrupt the logic by unrolling.

Pari Rajaram
  • 422
  • 3
  • 7