2

While reading a post on StackOverflow (http://stackoverflow.com/questions/1502081/im-trying-to-optimize-this-c-code-using-4-way-loop-unrolling), which is now marked as closed, I came across an answer (comment, in fact) that said the following: "The two inner loops could possibly get a speed boost by using UInt64 and bit shifting"

Here is the code that was int he post:

char rotate8_descr[] = "rotate8: rotate with 8x8 blocking";

    void rotate8(int dim, pixel *src, pixel *dst) 
    {

    int i, j, ii, jj;

    for(ii = 0; ii < dim; ii += 8)
           for(jj = 0; jj < dim; jj += 8)
                  for (i = ii; i < ii + 8; i++)   
                      for (j = jj; j < jj + 8; j++)
                          dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
    }

Could anyone please explain how would that be applied here? I am interested in knowing how to apply bitshifting on this code, or a similar code, and why that would help in performance. Also, how would this code be optimized for cache usage? Any suggestions?

Assume this code was Double Tiled/Blocked (big tile=32, and inside it tiles of 16), and also Loop Invariant Code Motion was applied.. would it still benefit from bitshifting and UInt64?

If not, then what other suggestions would work?

Thanks!

JoHa
  • 1,989
  • 10
  • 28
  • 42

1 Answers1

1

If the pixels were smaller, you could use 8 Uint64 registers (they are big and there are plenty of them) to cumulate there the result for rotated matrix.

Example for sizeof(pixel) == 1 and little endian machine:

for (int y = 0; y < 8; ++y){
 // for every line, we get 8 pixels from row y into src0.
 // they should go in the last colomn of the result
 // so after 8 iterations they'll get exactly in the 8ht byte 
  Uint64 src0 = *(Uint64*)(src + dim * y);
  dst0 = (dst0 << 8) | ( src0 & 0xff); // this was pixel src[y][0]
  dst1 = (dst1 << 8) | ((src0 >> 8) & 0xff); // and pixel src[y][1]
  etc...
};
// now the 8 dst0..dst7 registers contain rows 0..7 of the result. 
// putting them there
*(Uint64*)(dst) = dst0;
*(Uint64*)(dst + dim) = dst1;
etc..

The good part is that it's easier to unroll and reorder, and there are fewer memory accesses.

ruslik
  • 14,714
  • 1
  • 39
  • 40
  • so you mean with the current size of the "pixel" I can't use this? – JoHa Oct 12 '10 at 03:38
  • Sure you can, but the benefit could be greater on small pixels. Anyway, if you'll help the compiler to make the memory access in 64bit chunks only on aligned addresses, it will be great. It would be quite inefficient to let it work on unaligned 6 byte structures. – ruslik Oct 12 '10 at 03:44
  • Well I kind of get the concept. But could you please elaborate more on how this could be applied in my case? I am kind of lost after the second line.. could you continue the code till the end so that I test and understand the full picture? – JoHa Oct 12 '10 at 03:50
  • I understand that this reduces memory reading by flushing once, but I want to make sure I understand the concept correctly to absorb it.. Could you apply it on the most inner loop as it should be with the size of "pixel" that I have? – JoHa Oct 12 '10 at 03:51
  • When you say: dst0 = (dst0 << 8) | ( src0 & 0xff); why are you shifting a newly created Uint64 variable 8 bits to the left, while it is already empty? – JoHa Oct 12 '10 at 04:09
  • We need the shift in the next 7 iterations, and by making it in the first one we can skip initialization of the dst registers. – ruslik Oct 12 '10 at 04:13
  • I see... ok could you modify the post to have all the 8 lines (instead of the etc..)? I just want to make sure I understand it correctly – JoHa Oct 12 '10 at 04:19
  • Understood.. thanks! I will try to implement and let you know if I have any further questions – JoHa Oct 12 '10 at 05:48
  • Ok, so you said this could be used for the size of PIXEL of say 6 or 8 bytes. Could you explain how? Because the largest variable possible as I see is uint64, and this fits only 1 PIXEL structure, opposed to 8 pixels I have per row (if we assume size of TILE to be 8). How would I apply this bit-shifting trick on my case then? There is no way I can store more than one pixel in a uint64 variable, and thus I will end up doing 8 memory reads to get the 8 pixels.. plz explain thnx – JoHa Oct 12 '10 at 07:19