this is a follow up to this question, where I was trying to implement a 3D direct convolution with periodic boundary condition.
With the help of the community, I was able to increase the performance by ~50% (big thanks to the community):
int mod(int a, int b)
{
if (a<0)
return a + b;
else if (a >= b)
return a - b;
else
return a;
}
void convolve(const double *image, const double *kernel, const int imageDimX, const int imageDimY,
const int imageDimZ, const int kernelDimX, const int kernelDimY, const int kernelDimZ, double *result)
{
int imageSize = imageDimX * imageDimY * imageDimZ;
int kernelSize = kernelDimX * kernelDimY * kernelDimZ;
int i, j, k, l, m, n;
int kernelCenterX = (kernelDimX - 1) / 2;
int kernelCenterY = (kernelDimY - 1) / 2;
int kernelCenterZ = (kernelDimZ - 1) / 2;
int xShift,yShift,zShift;
int outIndex, outI, outJ, outK;
int outputIndex = 0, imageIndex = 0, kernelIndex = 0;
double currentPixelValue;
for (k = 0; k < imageDimZ; k++){
for ( j = 0; j < imageDimY; j++) {
for ( i = 0; i < imageDimX; i++) {
currentPixelValue = 0.0;
kernelIndex = 0;
for (n = 0, zShift = - kernelCenterZ; n < kernelDimZ; n++, zShift++){
outK = mod ((k - zShift), imageDimZ);
for ( m = 0, yShift = - kernelCenterY; m < kernelDimY; m++, yShift++) {
outJ = mod ((j - yShift), imageDimY);
for ( l = 0, xShift = - kernelCenterX; l < kernelDimX; l++, xShift++) {
outI = mod ((i - xShift), imageDimX);
imageIndex = outK * imageDimX * imageDimY + outJ * imageDimX + outI;
// This mysterious line
currentPixelValue += kernel[kernelIndex]* image[imageIndex];
kernelIndex++;
}
}
}
result[outputIndex] = currentPixelValue;
outputIndex ++;
}
}
}
}
as a reference, for a test case I have, this takes ~5.65s to run.
just out of curiosity, I was trying to pin point exactly which operation is the bottle neck, and it turns out to be the mysterious line in the inner-most loop.
by deleting that line it takes 0.66s to run.
So I thought maybe it's the array accessing that's taking too long, so I changed that line to
currentPixelValue += 1.0;
but the run time performance was only improved to 5.185s, which was definitely unexpected to me,
so I tried to change that += to = just as test:
currentPixelValue = 1.0;
the performance was improved vastly to 0.853s
so clearly the bottleneck was the += operation, which is again very interesting to me.
How is merely accessing the value of a variable and add a constant to it bottleneck the algorithm? Could you guys help me with some insights, and hopefully further improve the performance?
EDIT:
as another comparison case, I tried to change the line to
currentPixelValue = stencil[stencilIndex]* image[imageIndex];
and it took ~5.15s to run
I'm struggling to make sense out of this, I think the tests suggest that any kind of value access bottlenecks the algorithm. However, the line directly above it, also in the innermost loop, also has value accessing, doesn't seem to bottleneck it...
This is very mysterious and interesting to me lol
compilation info:
CC=mpicc
CFLAGS = -O3 -Wall -g -std=gnu99