C picture rotation optimization

Question

This is for all you C experts out there..

The first function takes a two-dimensional matrix src[dim][dim] representing pixels of an image, and rotates it 90 degrees into a destination matrix dst[dim][dim]. The second function takes the same src[dim][dim] and smoothens the image by replacing every pixel value with the average of all the pixels around it (in a maximum of 3 × 3 window centered at that pixel).

I need to optimize the program in account for time and cycles, how else would I be able to optimize the following?:

void rotate(int dim, pixel *src, pixel *dst,) 
{
    int i, j, nj;
    nj = 0;

    /* below are the main computations for the implementation of rotate. */
    for (j = 0; j < dim; j++) {
        nj = dim-1-j;               /* Code Motion moved operation outside inner for loop */
        for (i = 0; i < dim; i++) {
            dst[RIDX(nj, i, dim)] = src[RIDX(i, j, dim)];
        }
    }
}



/* A struct used to compute averaged pixel value */
typedef struct {
    int red;
    int green;
    int blue;
    int num;
} pixel_sum;

/* Compute min and max of two integers, respectively */
static int minimum(int a, int b) 
{ return (a < b ? a : b); }
static int maximum(int a, int b) 
{ return (a > b ? a : b); }

/* 
 * initialize_pixel_sum - Initializes all fields of sum to 0 
 */
static void initialize_pixel_sum(pixel_sum *sum) 
{
    sum->red = sum->green = sum->blue = 0;
    sum->num = 0;
    return;
}

/* 
 * accumulate_sum - Accumulates field values of p in corresponding 
 * fields of sum 
 */
static void accumulate_sum(pixel_sum *sum, pixel p) 
{
    sum->red += (int) p.red;
    sum->green += (int) p.green;
    sum->blue += (int) p.blue;
    sum->num++;
    return;
}

/* 
 * assign_sum_to_pixel - Computes averaged pixel value in current_pixel 
 */
static void assign_sum_to_pixel(pixel *current_pixel, pixel_sum sum) 
{
    current_pixel->red = (unsigned short) (sum.red/sum.num);
    current_pixel->green = (unsigned short) (sum.green/sum.num);
    current_pixel->blue = (unsigned short) (sum.blue/sum.num);
    return;
}

/* 
 * avg - Returns averaged pixel value at (i,j) 
 */
static pixel avg(int dim, int i, int j, pixel *src) 
{
    int ii, jj;
    pixel_sum sum;
    pixel current_pixel;

    initialize_pixel_sum(&sum);
    for(ii = maximum(i-1, 0); ii <= minimum(i+1, dim-1); ii++) 
        for(jj = maximum(j-1, 0); jj <= minimum(j+1, dim-1); jj++) 
            accumulate_sum(&sum, src[RIDX(ii, jj, dim)]);

    assign_sum_to_pixel(&current_pixel, sum);
    return current_pixel;
}

void smooth(int dim, pixel *src, pixel *dst) 
{
    int i, j;


    /* below are the main computations for the implementation of the smooth function. */

    for (j = 0; j < dim; j++)
        for (i = 0; i < dim; i++)
            dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
}

I moved dim-1-j outside of the inner for loop of rotate which reduces time and cycles used in the program, but is there anything else that can be used for either main function?

Thanks!

So, you are asking for help optimizing an otherwise functioning piece of code? — ryyker, Dec 06 '14 at 01:13
Yes is that not allowed on stackoverflow? I am stuck on what else CAN be optimized. Figured more experienced C programmers can skim through it and point out whats obvious — Sean, Dec 06 '14 at 01:14
Replace arithmetic with pointers and simple add operations. For example on the rotate setup a srcPtr and dstPtr. Calc pointers on each row of the source and then *dstPtr = *srcPtr++; dstPtr+= pitch. This avoids the multiply's etc for each pixel, etc — Chris, Dec 06 '14 at 01:15
Chris, isnt that something that the gcc compiler does automatically? — Sean, Dec 06 '14 at 01:19
Some compilers have optimization settings - size or speed. Check your documentation for any available optimization settings. Yes, certainly your question is allowed, but soft answers (such as this one) and opinion will be rampant, and precise answers few. By asking, I was just trying to determine if there was something more definite that could be addressed (a specific problem) — ryyker, Dec 06 '14 at 01:19
Sean, can you provide a sample input and expected output and the expected dimensions? I have some ideas but without actual code to test I can't help more. — Levi Morrison, Dec 06 '14 at 01:19
I cannot unfortunately provide any more code/files in a convenient manner on stack overflow:( I apologize for inconvenience, and simple pointers would help though. Input starts at dimensions 512 and outputs are time and cycles used. — Sean, Dec 06 '14 at 01:21

score 1 · Answer 1 · edited May 23 '17 at 11:50

There are several oprimizations you can do; some a compiler might do for you but best to write it out yourself. For example: moving constant expressions out of the loop (you did that once; there are more places you can do that - don't forget that the condition is checked every iteration too, so optimize the loop condition in this manner too) and, as Chris pointed out, use pointers that you increment instead of full array indexing. I also see some function calls that can be rewritten in-line.

I also want to point to an article on stackoverflow about matrix multiplication and optimizing that to use the processor cache. In essence it first rearranges the arrrays into memory bocks that fit the cache, then performs the operation on those blocks, then moves to the next block, and so on. You may be able to re-use the ideas for your rotation. See Optimizing assembly generated by Microsoft Visual Studio Compiler

score 1 · Answer 2 · 2014-12-06T11:12:37.663

For the rotation, you get a better utilization of the cache by decomposing in smaller image tiles.

For the smoothing,

1) expand the whole operation inside the main double loop, do not use these intermediate micro-functions;

2) completely unroll the accumulation and averaging (it's only a sum of 9 terms), hard coding the indexes;

3) process in different loops along the edges (where not all 9 pixels are available) and in the middle. The middle deserves maximum optimization (especially (2));

4) try and avoid the divisions by 9 (you can think of replacing the division by a table lookup).

Top speed will be obtained by handcrafting vectorized optimization (SSE/AVX), but this requires some deal of experience. Multicore parallelization is also an option.

To give you an idea, it is possible to apply a 3x3 average on a 1 MB grayscale image in less than 0.5 ms (monocore, Core i7@3.4 GHz). We can extrapolate to 2 ms or so for a 1 Mpixel RGB image.

Levi Morrison · Answer 3 · 2014-12-06T16:05:32.227

Since you can't provide a running program these are just ideas of things that could help:

Assuming values in the range [0,256) then use uint8_t as your rgbn values. This takes up 1/4 of the memory of the int version but will likely require more cycles; I can't know if this would be faster or not without more knowledge. The idea is that since you use 1/4 of the memory you are more likely to keep more values in L1-L3 cache.
Since your neighbors are the same whether you are rotated or not, calculate the average before rotating. I suspect this would help out with caching but again can't be sure; it depends on some code I can't see.
Parallelize the outer loop. Since you have easy grid dimensions and the inputs and outputs don't have read/write conflicts this is a trivial thing to do. This will certainly take more cycles but will possibly be faster.
Hard-code your edges; you are currently doing maximum and minimum operations on every call to average, but for the inner points it is unneeded. Calculate the edges and the inner points separately.

C picture rotation optimization

3 Answers3