I need to write a program that will read an image into an array, perform a convolution operation (sharpening) and save it back to a file (ppm). I wrote a standard algorithm:
unsigned char* imgBefore = malloc(height*(3*width)*sizeof(unsigned char));
assert(fread(imgBefore, 3*width, height, inputFile) == height);
unsigned char* imgAfter = malloc(height*(3*width)*sizeof(unsigned char));
short ker[3][3] = {{0, -1, 0}, {-1, 5, -1}, {0, -1, 0}};
unsigned char r, g, b;
for(int y = 0; y < height; y++) {
for(int x = 0; x < width; x++) {
if(y == 0 || y == height-1 || x == 0 || x == width-1) {
r = imgBefore[3*(width*y + x) + 0];
g = imgBefore[3*(width*y + x) + 1];
b = imgBefore[3*(width*y + x) + 2];
imgAfter[3*(width*y + x) + 0] = r;
imgAfter[3*(width*y + x) + 1] = g;
imgAfter[3*(width*y + x) + 2] = b;
continue;
}
int rSum = 0, gSum = 0, bSum = 0, val;
for(int dy = 0; dy < 3; dy++) { // kernels height
int yy = 3*width*(y+dy-1);
for(int dx = 0; dx < 3; dx++) { // kerenels width
int xx = 3*(x+dx-1);
val = ker[dy][dx];
rSum += val * imgBefore[yy + xx + 0];
gSum += val * imgBefore[yy + xx + 1];
bSum += val * imgBefore[yy + xx + 2];
}
}
rSum = rSum < 0 ? 0 : (rSum > 255 ? 255 : rSum);
gSum = gSum < 0 ? 0 : (gSum > 255 ? 255 : gSum);
bSum = bSum < 0 ? 0 : (bSum > 255 ? 255 : bSum);
imgAfter[3*(width*y + x) + 0] = rSum;
imgAfter[3*(width*y + x) + 1] = gSum;
imgAfter[3*(width*y + x) + 2] = bSum;
}
}
fwrite(imgAfter, 3*width, height, outputFile);
Next, I need to optimize its effectiveness in interacting with the cache. It seems to me that the problem part is in this piece of code:
for(int dy = 0; dy < 3; dy++) { // kernels height
int yy = 3*width*(y+dy-1);
for(int dx = 0; dx < 3; dx++) { // kerenels width
int xx = 3*(x+dx-1);
val = ker[dy][dx];
rSum += val * imgBefore[yy + xx + 0];
gSum += val * imgBefore[yy + xx + 1];
bSum += val * imgBefore[yy + xx + 2];
}
}
because it first loads one row of the matrix, uses only 3 (9) elements and then goes to the next row. This seems completely ineffective in relation to the cache.
What can I do to fix this?
I also tried to reuse individual pixels or entire rows. All this only worsened the result (there is a great chance that I simply poorly implemented structures like FIFO or added and read from them in the wrong places). If a program needs it, how should it look? For evaluation I use: valgrind --tool=cachegrind --I1=32768,8,64 --D1=32768,8,64 --LL=1048576,16,64 ./a.out
I would be grateful for any advice