0

this is a follow up to this question, where I was trying to implement a 3D direct convolution with periodic boundary condition.

With the help of the community, I was able to increase the performance by ~50% (big thanks to the community):

int mod(int a, int b)
{
    if (a<0)
        return a + b;
    else if (a >= b)
        return a - b;
    else
        return a;
}

void convolve(const double *image, const double *kernel, const int imageDimX, const int imageDimY, 
const int imageDimZ, const int kernelDimX, const int kernelDimY, const int kernelDimZ, double *result)
{
    int imageSize = imageDimX * imageDimY * imageDimZ;
    int kernelSize = kernelDimX * kernelDimY * kernelDimZ;

    int i, j, k, l, m, n;
    int kernelCenterX = (kernelDimX - 1) / 2;
    int kernelCenterY = (kernelDimY - 1) / 2;
    int kernelCenterZ = (kernelDimZ - 1) / 2;
    int xShift,yShift,zShift;
    int outIndex, outI, outJ, outK;
    int outputIndex = 0, imageIndex = 0, kernelIndex = 0;
    double currentPixelValue;

    for (k = 0; k < imageDimZ; k++){
        for ( j = 0; j < imageDimY; j++) {
            for ( i = 0; i < imageDimX; i++) {
                currentPixelValue = 0.0;

                kernelIndex = 0;
                for (n = 0, zShift = - kernelCenterZ; n < kernelDimZ; n++, zShift++){
                    outK = mod ((k - zShift), imageDimZ);
                    for ( m = 0, yShift = - kernelCenterY; m < kernelDimY; m++, yShift++) {
                        outJ = mod ((j - yShift), imageDimY);
                        for ( l = 0, xShift = - kernelCenterX; l < kernelDimX; l++, xShift++) {
                            outI = mod ((i - xShift), imageDimX);
                            
                            imageIndex = outK * imageDimX * imageDimY + outJ * imageDimX + outI;
                            
                            // This mysterious line
                            currentPixelValue += kernel[kernelIndex]* image[imageIndex]; 

                            kernelIndex++;
                        }
                    }
                }

                result[outputIndex] = currentPixelValue;
                outputIndex ++;
            }
        }
    } 
}

as a reference, for a test case I have, this takes ~5.65s to run.

just out of curiosity, I was trying to pin point exactly which operation is the bottle neck, and it turns out to be the mysterious line in the inner-most loop.

by deleting that line it takes 0.66s to run.

So I thought maybe it's the array accessing that's taking too long, so I changed that line to

currentPixelValue += 1.0;

but the run time performance was only improved to 5.185s, which was definitely unexpected to me,

so I tried to change that += to = just as test:

currentPixelValue = 1.0;

the performance was improved vastly to 0.853s

so clearly the bottleneck was the += operation, which is again very interesting to me.

How is merely accessing the value of a variable and add a constant to it bottleneck the algorithm? Could you guys help me with some insights, and hopefully further improve the performance?

EDIT:

as another comparison case, I tried to change the line to

currentPixelValue = stencil[stencilIndex]* image[imageIndex];

and it took ~5.15s to run

I'm struggling to make sense out of this, I think the tests suggest that any kind of value access bottlenecks the algorithm. However, the line directly above it, also in the innermost loop, also has value accessing, doesn't seem to bottleneck it...

This is very mysterious and interesting to me lol

compilation info:

CC=mpicc
CFLAGS = -O3 -Wall -g -std=gnu99
lxiangyun93
  • 77
  • 1
  • 7
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/218722/discussion-on-question-by-lxiangyun93-why-is-in-c-taking-so-much-time-and-ho). – Samuel Liew Jul 28 '20 at 01:25

1 Answers1

5

By replacing currentPixelValue += ... with currentPixelValue = 1 you are making most of the rest of the function unnecessary, thus allowing the compiler to optimize it out.

One would have to read the generated assembly to know for sure, but we can reason through it like this:

  • Suppose you replace currentPixelValue += ... with currentPixelValue = 1.

  • Now the value of imageIndex computed on the previous line is never used. So the compiler deletes that computation.

  • Now outI, outJ, outK are no longer used either, and since the compiler can inline the mod function it knows it has no side effects, so those computations can be optimized out as well.

  • kernelIndex is now never used either, so get rid of it too.

Your i loop now looks like this:

                currentPixelValue = 0.0;

                kernelIndex = 0;
                for (n = 0, zShift = - kernelCenterZ; n < kernelDimZ; n++, zShift++){
                    for ( m = 0, yShift = - kernelCenterY; m < kernelDimY; m++, yShift++) {
                        for ( l = 0, xShift = - kernelCenterX; l < kernelDimX; l++, xShift++) {
                            currentPixelValue = 1;
                        }
                    }
                }

                result[outputIndex] = currentPixelValue;
                outputIndex ++;

In other words, the n,m,l loops do nothing except setting currentPixelValue to 1 over and over again. The compiler knows that is pointless, so it just has to do it once. This also makes the initialization currentPixelValue = 0.0 unnecessary, and kernelIndex is not used at all. We are thus left with:

    for (k = 0; k < imageDimZ; k++){
        for ( j = 0; j < imageDimY; j++) {
            for ( i = 0; i < imageDimX; i++) {
                currentPixelValue = 1;
                result[outputIndex] = currentPixelValue;
                outputIndex ++;
            }
        }
    } 

That is, the function now doesn't do anything except fill the result matrix with 1s, and the compiler can do that in a very efficient way.

So the difference between 5.15 sec and 0.853 sec represents not just the addition, but practically all the interesting computation your function does.

(If you instead change the line to currentPixelValue += 1 then the compiler does still have to run all of the n,m,l loops to increment it the right number of times. If it were smart enough it could replace the loops by

currentPixelValue = 8 * kernelCenterX * kernelCenterY * kernelCenterZ;

but it may not be that smart.

Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
  • Ah, I think this explains all the tests, Thanks! So I guess unfortunately there is no way to optimize it with out SIMD/parallelization... – lxiangyun93 Jul 27 '20 at 22:28