I'm not sure how else I can optimize this piece of code so that it is efficient. So far I've unrolled the inner for loop by 16 with respect to j and it is producing a mean CPE of 1.4. I need to get a mean CPE around 2.5 through optimization techniques. I've read the other questions available on this and they're a bit different compared to the code mine question involves. The first part of code shows what I'm given which is followed by my attempt at unrolling the loop. The given code will scan the rows of the source image matrix and copy to the flipped row of the destination image matrix. Any help would be greatly appreciated!
RIDX Macro:
#define RIDX(i,j,n) ((i)*(n)+(j))
Given:
void naive_rotate(int dim, struct pixel_t *src, struct pixel_t *dst)
{
int i, j;
for(i = 0; i < dim; i++)
{
for(j = 0; j < dim; j++)
{
dst[RIDX(dim-1-i, j, dim)] = src[RIDX(i, j, dim)];
}
}
}
My attempt: This does optimize it but only a bit as the mean CPE goes up from 1.0 to 1.4. I'd like it to be around a 2.5 and I've tried various types of blocking and stuff I've read about online but have not managed to optimize it more.
for(i = 0; i < dim; i++){
for(j = 0; j < dim; j+=16){
dst[RIDX(dim-1-i,j, dim)] = src[RIDX(i,j,dim)];
dst[RIDX(dim-1-i,j+1, dim)] = src[RIDX(i,j+1,dim)];
dst[RIDX(dim-1-i,j+2, dim)] = src[RIDX(i,j+2,dim)];
dst[RIDX(dim-1-i,j+3, dim)] = src[RIDX(i,j+3,dim)];
dst[RIDX(dim-1-i,j+4, dim)] = src[RIDX(i,j+4,dim)];
dst[RIDX(dim-1-i,j+5, dim)] = src[RIDX(i,j+5,dim)];
dst[RIDX(dim-1-i,j+6, dim)] = src[RIDX(i,j+6,dim)];
dst[RIDX(dim-1-i,j+7, dim)] = src[RIDX(i,j+7,dim)];
dst[RIDX(dim-1-i,j+8, dim)] = src[RIDX(i,j+8,dim)];
dst[RIDX(dim-1-i,j+9, dim)] = src[RIDX(i,j+9,dim)];
dst[RIDX(dim-1-i,j+10, dim)] = src[RIDX(i,j+10,dim)];
dst[RIDX(dim-1-i,j+11, dim)] = src[RIDX(i,j+11,dim)];
dst[RIDX(dim-1-i,j+12, dim)] = src[RIDX(i,j+12,dim)];
dst[RIDX(dim-1-i,j+13, dim)] = src[RIDX(i,j+13,dim)];
dst[RIDX(dim-1-i,j+14, dim)] = src[RIDX(i,j+14,dim)];
dst[RIDX(dim-1-i,j+15, dim)] = src[RIDX(i,j+15,dim)];