How to make the following code of bilinear interpolation more efficient?

Question

The following code is to enlarge pictures using bilinear interpolation.

Where can be modified in the function of slow_rescale to make it more efficient?

I expect to modify it from the view of Principles of Computer Organization.

Looking forward to your answers!

Thanks!

unsigned char *slow_rescale(unsigned char *src, int src_x, int src_y, int dest_x, int dest_y)
{
 double step_x,step_y;          // Step increase as per instructions above
 unsigned char R1,R2,R3,R4;     // Colours at the four neighbours
 unsigned char G1,G2,G3,G4;
 unsigned char B1,B2,B3,B4;
 double RT1, GT1, BT1;          // Interpolated colours at T1 and T2
 double RT2, GT2, BT2;
 unsigned char R,G,B;           // Final colour at a destination pixel
 unsigned char *dst;            // Destination image - must be allocated here! 
 int x,y;               // Coordinates on destination image
 double fx,fy;              // Corresponding coordinates on source image
 double dx,dy;              // Fractional component of source image    coordinates

 dst=(unsigned char *)calloc(dest_x*dest_y*3,sizeof(unsigned char));   // Allocate and clear   destination image
 if (!dst) return(NULL);                           // Unable to allocate image

 step_x=(double)(src_x-1)/(double)(dest_x-1);
 step_y=(double)(src_y-1)/(double)(dest_y-1);

 for (x=0;x<dest_x;x++)         // Loop over destination image
  for (y=0;y<dest_y;y++)
  {
    fx=x*step_x;
    fy=y*step_y;
    dx=fx-(int)fx;
    dy=fy-(int)fy;   
    getPixel(src,floor(fx),floor(fy),src_x,&R1,&G1,&B1);    // get N1 colours
    getPixel(src,ceil(fx),floor(fy),src_x,&R2,&G2,&B2); // get N2 colours
    getPixel(src,floor(fx),ceil(fy),src_x,&R3,&G3,&B3); // get N3 colours
    getPixel(src,ceil(fx),ceil(fy),src_x,&R4,&G4,&B4);  // get N4 colours
   // Interpolate to get T1 and T2 colours
   RT1=(dx*R2)+(1-dx)*R1;
   GT1=(dx*G2)+(1-dx)*G1;
   BT1=(dx*B2)+(1-dx)*B1;
   RT2=(dx*R4)+(1-dx)*R3;
   GT2=(dx*G4)+(1-dx)*G3;
   BT2=(dx*B4)+(1-dx)*B3;
   // Obtain final colour by interpolating between T1 and T2
   R=(unsigned char)((dy*RT2)+((1-dy)*RT1));
   G=(unsigned char)((dy*GT2)+((1-dy)*GT1));
   B=(unsigned char)((dy*BT2)+((1-dy)*BT1));
  // Store the final colour
  setPixel(dst,x,y,dest_x,R,G,B);
 }
  return(dst);
}
void getPixel(unsigned char *image, int x, int y, int sx, unsigned char *R, unsigned char *G, unsigned char *B)
{
 // Get the colour at pixel x,y in the image and return it using the provided RGB pointers
 // Requires the image size along the x direction!
 *(R)=*(image+((x+(y*sx))*3)+0);
 *(G)=*(image+((x+(y*sx))*3)+1);
 *(B)=*(image+((x+(y*sx))*3)+2);
}

void setPixel(unsigned char *image, int x, int y, int sx, unsigned char R, unsigned char G, unsigned char B)
{
 // Set the colour of the pixel at x,y in the image to the specified R,G,B
 // Requires the image size along the x direction!
 *(image+((x+(y*sx))*3)+0)=R;
 *(image+((x+(y*sx))*3)+1)=G;
 *(image+((x+(y*sx))*3)+2)=B;
}

I have just edited it and added the functions of getPixel() and setPixel().@self. — user2964454, Jan 01 '14 at 16:29
Well apart from completely changing the algorithm, you could remove some redundant multiplications. At getPixel: `int x, int y, int sx` pass `((x+(y*sx))*3)+0` to the function instead. — this, Jan 01 '14 at 16:33
You may want to ask this on [CodeReview](http://codereview.stackexchange.com) it would be more appropriate there. — Mark Hall, Jan 01 '14 at 16:38
`-O3`? maybe profiling? then, though some of these can be wasteful/ineffective because of compiler optimization: inline get/setPixel; change them so that one 32bit read/write from/to mem is done (beware of endianness), possibly 4/8-byte aligned (can you have "R G B 0 R G B 0" instead of "R G B R G B"?); some more temporary vars (e.g. for floor(fx)) … — ShinTakezou, Jan 01 '14 at 16:58
btw, tried some suggested "manual" optimizations, without compiler optim, … it turns out that `-O3` outperforms these "manual" optimizations. — ShinTakezou, Jan 01 '14 at 17:37

score 3 · Answer 1 · edited May 23 '17 at 12:07

I worry about image-processing performance all the time. Below are a few obvious considerations to keep in mind:

Numerical Precision:

The first thing that jumps out at me from your code is the use of doubles for step size, color values, and coordinates. Do you really need that level of precision for these quantities? If not, you might do some profiling to check the performance of your code when using fixed point or floats instead.

Keep in mind that this is a hardware dependent question and the performance may or may not be an issue depending on whether or not your hardware implements double, float only, or neither (then both are implemented in software). Discussions on this aspect also include memory alignment, coalesced memory access, etc. Certainly these topics touch on "Principles of Computer Organization," there is more discussion on this topic is here.

Loop Unrolling:

Have you also considered manual loop unrolling? This may or may not help since your compiler may already try to take advantage of such optimizations, but its at least worth thinking about since you have a double-loop over potentially large array sizes.

Numerical Redundancies:

Inside your getPixel() function, you also calculate image+((x+(y*sx))*3, for each RGB component, and this doesn't appear to change, why not just calculate this quantity once at the start of your function?

Vector Processing:

Its hard to think about optimizing such a code without first wondering whether or not you can take advantage of vector processing. Do you have access to vectorized instructions sets, e.g., SSE?

Parallel Processing:

Most systems have OpenMP installed. If so, You might consider restructuring your code to take advantage of your processor's multi-core capabilities. This is surprising straightforward to implement using pragma's, its certainly worth checking out.

Compiler Flags:

Also, although you didn't mention it directly, compilation flags affect the performance of C-code. E.g., if using gcc you might compare performance differences using:

gcc -std=c99 -o main main.c

vs.

gcc -std=c99 -O3 -o main main.c

score 2 · Answer 2 · answered Jan 01 '14 at 16:58

2

Here are some ideas:

Use fixed-point arithmetic instead of floating-point. This will make calculations like floor and ceil (and possibly multiplications, though I am not sure about that) faster.
Replace ceil(x) by floor(x)+1
Use strength reduction to replace multiplication in fx=x*step_x by addition
If you know the layout of pixels in memory, replace getPixel by something more efficient
Reduce two multiplications to one using the following code transformation: (dx*R2)+(1-dx)*R1 ==> R1+dx*(R2-R1)
Unroll the inner loop
(Last, but maybe has most potential) Use a vectorizing compiler or edit your code manually to use SSE or other SIMD technique (if available on your platform)

answered Jan 01 '14 at 16:58

anatolyg

26,506
9
60
134

`ceil(x)` is not equal to `floor(x) + 1` if `x` is already a whole number. – Alnitak Jan 01 '14 at 17:45
ah, I see - the OP incorrectly used `ceil(x)` himself. I thought you were suggesting that the two were equivalent rather than correcting the OP's misuse of the former. – Alnitak Jan 01 '14 at 17:47
@Alnitak I see nothing wrong in OP's usage of `ceil`. Also, if the argument of `ceil` is an integer, then `dx=0`, and the calculation gets multiplied by 0, so it's OK for it to be incorrect (however, one might need to avoid reading from bad addresses, which might result in page faults and what not). – anatolyg Jan 01 '14 at 18:42
Yes, as it happens it degenerates to the correct answer, but I think that's more likely to be by luck rather than judgement. – Alnitak Jan 01 '14 at 18:49

LiMar · Answer 3 · 2014-01-02T09:04:55.067

Multiplication operation can be largely reduced in this code.

dx can be calculated in the outer loop, and there we can prepare multiplication table for further operations like RT1=(dx*R2)+(1-dx)*R1 because multiplication(R2,R1,etc) is of 1 byte size.

The following code runs ~10 times faster then the original on my machine (Mac OS, Mac C++ compiler with -O3):

#include <stdio.h>
#include <math.h>
#include <stdlib.h>

inline void fast_getPixel(unsigned char *image, int x, int y, int sx, unsigned char *R, unsigned char *G, unsigned char *B)
{
    // Get the colour at pixel x,y in the image and return it using the provided RGB pointers
    // Requires the image size along the x direction!
    unsigned char *ptr = image+((x+(y*sx))*3);
    *R=ptr[0];
    *G=ptr[1];
    *B=ptr[2];
}

inline void fast_setPixel(unsigned char *image, int x, int y, int sx, unsigned char R, unsigned char G, unsigned char B)
{
    // Set the colour of the pixel at x,y in the image to the specified R,G,B
    // Requires the image size along the x direction!
    unsigned char *ptr = image+((x+(y*sx))*3);
    ptr[0]=R;
    ptr[1]=G;
    ptr[2]=B;
}

void build_dx_table(double* table,double dx)
{
    unsigned len = 0xff;
    table[0] = 0;
    for (unsigned i=1;i<len;i++)
    {
        table[i] = table[i-1]+dx;
    }
}

unsigned char *fast_rescale(unsigned char *src, int src_x, int src_y, int dest_x, int dest_y)
{
    double step_x,step_y;          // Step increase as per instructions above
    unsigned char R1,R2,R3,R4;     // Colours at the four neighbours
    unsigned char G1,G2,G3,G4;
    unsigned char B1,B2,B3,B4;
    double RT1, GT1, BT1;          // Interpolated colours at T1 and T2
    double RT2, GT2, BT2;
    unsigned char R,G,B;           // Final colour at a destination pixel
    unsigned char *dst;            // Destination image - must be allocated here!
    int x,y;               // Coordinates on destination image
    double fx,fy;              // Corresponding coordinates on source image
    double dx,dy;              // Fractional component of source image    coordinates
    double dxtable[0xff];

    dst=(unsigned char *)calloc(dest_x*dest_y*3,sizeof(unsigned char));   // Allocate and clear   destination image
    if (!dst) return(NULL);                           // Unable to allocate image

    step_x=(double)(src_x-1)/(double)(dest_x-1);
    step_y=(double)(src_y-1)/(double)(dest_y-1);

    for (x=0,fx=0;x<dest_x;x++,fx+=step_x)         // Loop over destination image
        dx=fx-(int)fx;
        build_dx_table(dxtable,dx);
        for (y=0,fy=0;y<dest_y;y++,fy+=step_y)
        {
            dy=fy-(int)fy;
            fast_getPixel(src,floor(fx),floor(fy),src_x,&R1,&G1,&B1);    // get N1 colours
            fast_getPixel(src,ceil(fx),floor(fy),src_x,&R2,&G2,&B2); // get N2 colours
            fast_getPixel(src,floor(fx),ceil(fy),src_x,&R3,&G3,&B3); // get N3 colours
            fast_getPixel(src,ceil(fx),ceil(fy),src_x,&R4,&G4,&B4);  // get N4 colours
            // Interpolate to get T1 and T2 colours
            RT1=dxtable[R2-R1]+R1;
            GT1=dxtable[G2-G1]+G1;
            BT1=dxtable[B2-B1]+B1;
            RT2=dxtable[R4-R3]+R3;
            GT2=dxtable[G4-G3]+G3;
            BT2=dxtable[B4-B3]+B3;
            // Obtain final colour by interpolating between T1 and T2
            R=(unsigned char)(dy*(RT2-RT1)+RT1);
            G=(unsigned char)(dy*(GT2-GT1)+GT1);
            B=(unsigned char)(dy*(BT2-BT1)+BT1);
            // Store the final colour
            fast_setPixel(dst,x,y,dest_x,R,G,B);
        }
    return(dst);
}

When I run your code in my machine,it gives a debug error with message "DAMAGE:after normal block(#161) at 0x0092B040".I think the problem is connected with the array dxtable[],but I don't know why.Could you help me with this question? @LiMar — user2964454, Jan 03 '14 at 05:05

Z boson · Answer 4 · 2014-01-02T20:18:55.067

GPUs have hardware to do bilinear interpolation for you. Doing this on the CPU is like doing floating point operations in software without using the floating point hardware (e.g. x87 or SSE/AVX). My best advice is to consider optimizing algorithms such as bicubic interpolation or general image filters which may give better visual results and which are not supported on most GPUs. Graphic Gems III, even though it is ancient, has a good section on "General Filtered Image Rescaling" both for maginfication and minification.

However, if you still want to do bilinear interpolation on the CPU you should consider hardware speedups on the CPU. In that case I would look at using SIMD. See this link bilinear-pixel-interpolation-using-sse which shows how to do bi-linear interpolation using SSE. I have tested this code and the SSE code is much faster. You could combine that with OpenMP to use multiple threads on large images.

I also tested the fixed point code and found that it gave better results than the non-SSE code with MSVC2010 but not in MSVC2012. I expect that for most modern compilers the fixed point code won't be any better unless it's run on a embedded system without floating point hardware.