Optimization through loop unrolling and blocking

Question

I'm not sure how else I can optimize this piece of code so that it is efficient. So far I've unrolled the inner for loop by 16 with respect to j and it is producing a mean CPE of 1.4. I need to get a mean CPE around 2.5 through optimization techniques. I've read the other questions available on this and they're a bit different compared to the code mine question involves. The first part of code shows what I'm given which is followed by my attempt at unrolling the loop. The given code will scan the rows of the source image matrix and copy to the flipped row of the destination image matrix. Any help would be greatly appreciated!

RIDX Macro:

#define RIDX(i,j,n) ((i)*(n)+(j))

Given:

 void naive_rotate(int dim, struct pixel_t *src, struct pixel_t *dst)
 {
 int i, j;
 for(i = 0; i < dim; i++)
     {
             for(j = 0; j < dim; j++)
             {
                dst[RIDX(dim-1-i, j, dim)] = src[RIDX(i, j, dim)];
            }
    }
}

My attempt: This does optimize it but only a bit as the mean CPE goes up from 1.0 to 1.4. I'd like it to be around a 2.5 and I've tried various types of blocking and stuff I've read about online but have not managed to optimize it more.

 for(i = 0; i < dim; i++){                                                                                                                                                  
 for(j = 0; j < dim; j+=16){                                                                                                                                                      

  dst[RIDX(dim-1-i,j, dim)] = src[RIDX(i,j,dim)];                                                                                                                                
  dst[RIDX(dim-1-i,j+1, dim)] = src[RIDX(i,j+1,dim)];                                                                                                                            
  dst[RIDX(dim-1-i,j+2, dim)] = src[RIDX(i,j+2,dim)];                                                                                                                            
  dst[RIDX(dim-1-i,j+3, dim)] = src[RIDX(i,j+3,dim)];                                                                                                                            
  dst[RIDX(dim-1-i,j+4, dim)] = src[RIDX(i,j+4,dim)];                                                                                                                            
  dst[RIDX(dim-1-i,j+5, dim)] = src[RIDX(i,j+5,dim)];                                                                                                                            
  dst[RIDX(dim-1-i,j+6, dim)] = src[RIDX(i,j+6,dim)];                                                                                                                            
  dst[RIDX(dim-1-i,j+7, dim)] = src[RIDX(i,j+7,dim)];                                                                                                                            
  dst[RIDX(dim-1-i,j+8, dim)] = src[RIDX(i,j+8,dim)];                                                                                                                            
  dst[RIDX(dim-1-i,j+9, dim)] = src[RIDX(i,j+9,dim)];                                                                                                                            
  dst[RIDX(dim-1-i,j+10, dim)] = src[RIDX(i,j+10,dim)];                                                                                                                          
  dst[RIDX(dim-1-i,j+11, dim)] = src[RIDX(i,j+11,dim)];                                                                                                                          
  dst[RIDX(dim-1-i,j+12, dim)] = src[RIDX(i,j+12,dim)];                                                                                                                          
  dst[RIDX(dim-1-i,j+13, dim)] = src[RIDX(i,j+13,dim)];                                                                                                                          
  dst[RIDX(dim-1-i,j+14, dim)] = src[RIDX(i,j+14,dim)];                                                                                                                          
  dst[RIDX(dim-1-i,j+15, dim)] = src[RIDX(i,j+15,dim)];

Can we assume you have turned the optimization level up in your compiler to see what it can do for you? If not, I recommend starting there. — user4581301, Jul 21 '17 at 00:04
No, I haven't done that. I'm not fully aware of how that works. I'm supposed to optimize it through common techniques like blocking, unrolling, and motion. — user45678, Jul 21 '17 at 00:05
The loop unrolling works only if dim is a multlple of 16... :( — Michaël Roy, Jul 21 '17 at 00:12
Ah. Got you. You may be able to game some benefit from improving or caching `RIDX(i, j, dim)`. What the compiler can provide may give you a good baseline of what sort of improvements you can expect, so it might be worth switching on and re-profiling. Off topic, What is CPE? I've googled the term and come up with nothing that seems to fit. Computation P-something Efficiency? — user4581301, Jul 21 '17 at 00:15
@MichaëlRoy "The loop unrolling works only if dim is a multlple of 16..." not necessarily. [See Duff's Device](https://stackoverflow.com/questions/514118/how-does-duffs-device-work) — user4581301, Jul 21 '17 at 00:16
@MichaëlRoy, I believe so. I tried 2 X 2 unrolling and it gave me a way lower CPE than what I'm getting from this. I'm not sure if unrolling both loops will be helpful or not? — user45678, Jul 21 '17 at 00:17
Unrolling is only useful if the expense of the loop is noticeable against what's in the loop. In the typical example of `a[i] = b[i];` the loop overhead is significant. In this case we have no idea how expensive `RIDX(dim-1-i,j, dim)` is. We also have no idea how this may aid or abuse the CPU's cache. Ah! Sorry I see an edit up there with `RIDX`. Apologies. May be a bit of use in unrolling, but not much. — user4581301, Jul 21 '17 at 00:20
I'm a bit confused, because the operation doen'st look at all like a rotate. It's a row inversion. Are you certain about the algorithm? — Michaël Roy, Jul 21 '17 at 00:21
By trying to arrange for pointer operations only, I quickly arrive at a fairly good result using memcpy - huge speed bioost. Is pixel_t safe to copy/move using memcpy? Can you show it? — Michaël Roy, Jul 21 '17 at 00:27
@MichaëlRoy yes it's a row inversion and what do you exactly mean about pixel_t ? — user45678, Jul 21 '17 at 00:31
@user45678, I'm not sure as there is a separate program I run to check the CPE. That is not of interest to me as only optimizing this code is. — user45678, Jul 21 '17 at 00:32
pixel_t is the type we're moving around, the contents of the matrices, I got it from your function declaration. — Michaël Roy, Jul 21 '17 at 00:34
@MichaëlRoy yes I understand but what were you asking about it related to memcpy? — user45678, Jul 21 '17 at 00:35
OK. I asked because if it's time based, we can take advantage of cache usage optimization. If it's brute operations performed, there's not point trying to take advantage of what the CPU can do for you. — user4581301, Jul 21 '17 at 00:36
@user4581301 it's running for about 9 different dimensions and returning a CPE for each one then computing a mean CPE. I think it could be time based. — user45678, Jul 21 '17 at 00:39
@user4581301 can you please show a small example of what you mean by swapping? — user45678, Jul 21 '17 at 00:40
The idea was you only have to make half as many visits because if you move `src[20]` to `dst[0]` and `src[0]` to `dst[20]`, you do both at the same time. saves having to do the same `RIDX` calls in reverse. — user4581301, Jul 21 '17 at 00:44
@user4581301 memcpy is probably the most optimized function of all the runtime library. t would take quite a effort to beat memcpy at copying memory, if at all possible. — Michaël Roy, Jul 21 '17 at 00:49
@MichaëlRoy, I dont think I can use memcpy because I'm supposed to optimize it using motion, unrolling, and blocking. — user45678, Jul 21 '17 at 00:52
`memcpy` or no, the dodge @MichaëlRoy uses only using `RIDX` once should help a lot. — user4581301, Jul 21 '17 at 00:55
@user4581301 Then my first loop is a good place to start. And memcpy should give you a base timing to give you an idea of how fast you'll be able to get this thing going.. — Michaël Roy, Jul 21 '17 at 00:59
@MichaëlRoy, I will check this out in an hour or so and update. — user45678, Jul 21 '17 at 01:05
for anyone: would tiling not help in this case? Doesn't tiling try to optimize the code with efficient usage of the cache and memory? — user45678, Jul 21 '17 at 03:28
@MichaëlRoy, I cannot use memcpy because I'm told not to use functions like that. Any other ideas? — user45678, Jul 21 '17 at 14:44
@MichaëlRoy sorry I guess I had a misunderstanding between tiling and blocking. Then I'm not allowed to use tiling. — user45678, Jul 21 '17 at 15:12

Michaël Roy · Answer 1 · 2017-07-21T01:13:56.130

1

Here's a quick old-school memcpy optimization. This is usually very efficiebt. Should not need unrolling.

From RIDX:

#define RIDX(i,j,n) ((i)*(n)+(j))

We know that incrementing the 'j'component translates to a simple pointer increment.

struct pixel_t* s = src[RIDX(0, 0, dim)];
struct pixel_t* d = dst[RIDX[dim - 1, 0, dim];

for (int i = 0; i < dim; ++i, d -= (2 * dim))
{
   for (int j = 0; j < dim; ++j, ++s, ++d)
   {
       //dst[RIDX(dim-1-i, j, dim)] = src[RIDX(i, j, dim)];
       *d = *s;  

       // you could do it the hard way and start loop unrolling from here
   }
}

In the inner loop in the code above, ++s, ++d give a hint that a memcpy optimization is possible. Note that a memcpy optimization is only possible if the type we're copying can be moved safely. Most type are. But it's something that has to be taken into account. Using memcpy does bend the strict rules of c++ a bit, but memcpy is fast.

The loops then become:

struct pixel_t* s = src[RIDX(0, 0, dim)];
struct pixel_t* d = dst[RIDX[dim - 1, 0, dim];

for (int i = 0; i < dim; ++i, d -= dim, s += dim)
{
   memcpy(d, s, dim * sizeof(pixel_t));
   // or...
   std::copy(s, s + dim, d);  // which is 'safer' but could be slower...
}

in most modern stl implementations, std::copy will translate to a memcpy in most cases. memcpy uses all the tricks in the book to make the copy faster - loop unrolling, cache look-ahead, etc...

edited Jul 21 '17 at 01:13

answered Jul 21 '17 at 00:44

Michaël Roy

6,338
1
15
19

few things, the first part is basically what? Are you just showing me what incrementing j component does or do I need to insert that code and use it? In the last part the declarations for pixel_t*s and pixel_t*d is throwing me an error "incompatible types", are you sure this should work? – user45678 Jul 21 '17 at 03:03
I can't use memcpy as it is not allowed. Not allowed to use functions like that. So this isn't a valid solution, but thanks Michael for teaching me about memcpy. – user45678 Jul 21 '17 at 12:43
2

Sorry, I live in the real world. And that's how it's done. I know it's impossible to explain to your teacher, but not letting you use language features doesn't teach you anything. And excuse me for not doing your assignment. – Michaël Roy Jul 21 '17 at 14:59
So I cannot receive any guidance on this other than using memcpy? – user45678 Jul 21 '17 at 15:21
You can still use the first loop as a good basis for blocking and loop unrolling, using only simple pointer addition. Unrolling by 4 will give you good results in this case, Or, you can implement your own memcpy-like function. – Michaël Roy Jul 21 '17 at 15:27

Optimization through loop unrolling and blocking

1 Answers1