I'm trying to optimize this c code using 4-way loop unrolling

Question

What I'm trying to do is take this C code and optimize it using a technique called loop unrolling, but in this case I want to use four-way loop unrolling. Now, I understand the technique and I understand the concept I just don't know how to apply it to this code. Do I have to add in some extra variables? Do I have to have some code after each loop or just at the end of all the loops? This code is 8x8 block code dealing with taking pixels and rotating it 90 degrees counter clock wise. Any help would greatly be appreciated. Thank You.

/* 
 * rotate8 - rotate with 8x8 blocking
 */

char rotate8_descr[] = "rotate8: rotate with 8x8 blocking";

void rotate8(int dim, pixel *src, pixel *dst) 
{

int i, j, ii, jj;

for(ii = 0; ii < dim; ii += 8)
       for(jj = 0; jj < dim; jj += 8)
              for (i = ii; i < ii + 8; i++)   
                  for (j = jj; j < jj + 8; j++)
                      dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}

Also, consider using more descriptive names like "top, bottom, left, right", not "i, ii, j, jj". Very hard to read. — GManNickG, Oct 01 '09 at 05:49
The two inner loops could possibly get a speed boost by using UInt64 and bit shifting. — Jonas Elfström, Oct 01 '09 at 09:15
why did you add [closed] to the end of the question? That doesn't mean "answered" ... — bdonlan, Oct 17 '09 at 23:25

score 5 · Accepted Answer · answered Oct 01 '09 at 05:52

You can replace the inner loop with 8 explicit lines of code

          dst[RIDX(dim-1-jj, i, dim)] = src[RIDX(i, jj, dim)];
          dst[RIDX(dim-1-(jj+1), i, dim)] = src[RIDX(i, (jj+1), dim)];
          ...
          dst[RIDX(dim-1-(jj+7), i, dim)] = src[RIDX(i, (jj+7), dim)];

so you are replacing the loop variable by explicitly writing a line for each value it takes.

Now you can repeat that for the 8 values of the next loop, you'll have 8 x 8 lines of code, and so on.

As anything other than an exercise in understanding, this seems pretty pointless to me, compilers do this kind of stuff really efficiently, they'll optimise where it makes sense. Hand-rolling rarely produces optimal code.

+1 I recommend the OP profile his code and see where the bottlenecks are, then disassemble his code and see how the compiler is handling the bottlenecks. — Chris Lutz, Oct 01 '09 at 05:55

score 4 · Answer 2 · answered Oct 01 '09 at 14:37

I wanted to say profile it - but then I did so myself. The surprising part is - the inner loop performs fastest with exactly your layout - unrolling it by hand is actually slower.

However - the real catch is the RIDX macro. Switching the memory layout and switching the outer loops has a significant impact.

Here's my fastest version with indentation to show where it differs from your version. The RIDX macro is assumed to be as defined.

#define RIDX(x,y,d) (x+(y)*(d))
typedef unsigned char pixel;
void rotate8(int dim, pixel *src, pixel *dst)
{
    int i, j, ii, jj;
        for(jj = 0; jj < dim; jj += 8)
    for(ii = 0; ii < dim; ii += 8)
              for (i = ii; i < ii + 8; i++)
                  for (j = jj; j < jj + 8; j++)
                      dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}

... lesson learned: Always profile :-)

Georg Schölly · Answer 3 · 2009-10-01T06:32:38.973

gcc -funrull-loops

You shouldn't unroll loops yourself unless GCC cannot do it (look at the assembly) and you've proven by using a profiler that you have to speed up this part of the code.

That example code you have looks like a perfect candidate for automatic loop unrolling.

Some other useful flags:

-O3                          // turns on a lot of optimizations (almost all)
-ftree-vectorize -msse2      // vectorizes automatically some loops

score 0 · Answer 4 · answered Oct 02 '09 at 00:12

http://www.relisoft.com/book/lang/pointer/2ptrarr.html

If your compiler is unable to optimize the human readable, maintainable version of the algorithm, and you have to double as a human compiler-- buy a new compiler! Nobody can afford human compilers any more. So, have mercy on yourself and your fellow programmers who will have to look at your code.

Aki Suihkonen · Answer 5 · 2021-09-07T05:33:56.903

Rotation by 8x8 is more efficiently done by SIMD or SWAR technique, that can read at least 64 bits at a time.

Rot90Left(X) = flip_vertical(transpose(X))
Rot90Right(X) = transpose(flip_vertical(X))

Vertical flip is a zero-cost operation, as it only means storing/reading from the opposite end of temporary variables. If the SSE / SIMD implementation of the transpose can't be used, this kernel has proven quite fast on x64 and arm64-v8.

inline void transpose_u8(uint64_t *a, uint64_t *b) {
     uint64_t A = *a, B = *b, C = B ^ (A>>8)) & 0x00ff00ff00ff00ffull;
     *a = A ^ (C << 8);
     *b = B ^ C;
}
inline void transpose_u16(uint64_t *a, uint64_t *b) {
     uint64_t A = *a, B = *b, C = B ^ (A>>16)) & 0x0000ffff0000ffffull;
     *a = A ^ (C << 16);
     *b = B ^ C;
}
inline void transpose_u32(uint64_t *a, uint64_t *b) {
     uint64_t A = *a, B = *b, C = B ^ (A>>32)) & 0x00000000ffffffffull;
     *a = A ^ (C << 32);
     *b = B ^ C;
}
void transpose8x8(uint8_t *src, int skip0, uint8_t *dst, int skip1) {
     uint64_t d[8];
     for (int x = 0; x < 8; x++)
         memcpy(d+(x ^ LEFT), src + x * skip0);
     transpose_u8(d+0, d+1);
     transpose_u8(d+2, d+3);
     transpose_u8(d+4, d+5);
     transpose_u8(d+6, d+7);
     transpose_u16(d+0, d+2);
     transpose_u16(d+1, d+3);
     transpose_u16(d+4, d+6);
     transpose_u16(d+5, d+7);
     transpose_u32(d+0, d+4);
     transpose_u32(d+1, d+5);
     transpose_u32(d+2, d+6);
     transpose_u32(d+3, d+7);
     for (int x = 0; x < 8; x++)
         memcpy(dst + x * skip1, d + (x ^ RIGHT));
}

Here the right rotation happens by setting LEFT=0, RIGHT=7
Left rotation == LEFT=7, RIGHT = 0
Transpose = LEFT=0, RIGHT=0

My hypothesis is that any decent compiler will replace all the internal memory reads in the transpose_uXX functions by directly modifying variables stored in registers and will replace the memcpy by single 64-bit read or write to memory -- this should happen at least with 64-bit architectures. On pre-x86 there won't be enough registers and the practical alternative is to use which ever SIMD register and instruction set is available.

I'm trying to optimize this c code using 4-way loop unrolling

5 Answers5

Linked