Optimization of the algorithm when working with images in C (image convolution)

Question

I need to write a program that will read an image into an array, perform a convolution operation (sharpening) and save it back to a file (ppm). I wrote a standard algorithm:

    unsigned char* imgBefore = malloc(height*(3*width)*sizeof(unsigned char));
    assert(fread(imgBefore, 3*width, height, inputFile) == height);
    unsigned char* imgAfter = malloc(height*(3*width)*sizeof(unsigned char));

    short ker[3][3] = {{0, -1, 0}, {-1, 5, -1}, {0, -1, 0}};

    unsigned char r, g, b;

    for(int y = 0; y < height; y++) {
        for(int x = 0; x < width; x++) {
            if(y == 0 || y == height-1 || x == 0 || x == width-1) {
                r = imgBefore[3*(width*y + x) + 0];
                g = imgBefore[3*(width*y + x) + 1];
                b = imgBefore[3*(width*y + x) + 2];
                imgAfter[3*(width*y + x) + 0]  = r;
                imgAfter[3*(width*y + x) + 1]  = g;
                imgAfter[3*(width*y + x) + 2]  = b;
                continue;
            }

            int rSum = 0, gSum = 0, bSum = 0, val;

            for(int dy = 0; dy < 3; dy++) { // kernels height
                int yy = 3*width*(y+dy-1);
                for(int dx = 0; dx < 3; dx++) { // kerenels width
                    int xx = 3*(x+dx-1);
                    val = ker[dy][dx];

                    rSum += val * imgBefore[yy + xx + 0];
                    gSum += val * imgBefore[yy + xx + 1];
                    bSum += val * imgBefore[yy + xx + 2];
                }
            }
            rSum = rSum < 0 ? 0 : (rSum > 255 ? 255 : rSum);
            gSum = gSum < 0 ? 0 : (gSum > 255 ? 255 : gSum);
            bSum = bSum < 0 ? 0 : (bSum > 255 ? 255 : bSum);

            imgAfter[3*(width*y + x) + 0] = rSum;
            imgAfter[3*(width*y + x) + 1] = gSum;
            imgAfter[3*(width*y + x) + 2] = bSum;
        }
    }

    fwrite(imgAfter, 3*width, height, outputFile);

Next, I need to optimize its effectiveness in interacting with the cache. It seems to me that the problem part is in this piece of code:

for(int dy = 0; dy < 3; dy++) { // kernels height
    int yy = 3*width*(y+dy-1);
    for(int dx = 0; dx < 3; dx++) { // kerenels width
        int xx = 3*(x+dx-1);
        val = ker[dy][dx];

        rSum += val * imgBefore[yy + xx + 0];
        gSum += val * imgBefore[yy + xx + 1];
        bSum += val * imgBefore[yy + xx + 2];
    }
}

because it first loads one row of the matrix, uses only 3 (9) elements and then goes to the next row. This seems completely ineffective in relation to the cache.

What can I do to fix this?

I also tried to reuse individual pixels or entire rows. All this only worsened the result (there is a great chance that I simply poorly implemented structures like FIFO or added and read from them in the wrong places). If a program needs it, how should it look? For evaluation I use: valgrind --tool=cachegrind --I1=32768,8,64 --D1=32768,8,64 --LL=1048576,16,64 ./a.out

I would be grateful for any advice

You make the assumtion that the code is bounded by the memory hierarchy but I think this is not be the most critical point: your code is neither vectorized by GCC nor Clang (it seems also not to be parallelized). The problem comes from the data layout which is not SIMD-friendly. Thus, I expect the code to be bounded by the computations. Did you try to profile your program (except with valgrind)? What is the typical size of the image you use to make the tests? — Jérôme Richard, Apr 29 '20 at 17:03
The essence of the job is to optimize cache access. One of the first things I thought about was parallelization, but I realized that it can be added using the compiler options, which are fixed in my case. Or am I mistaken? In the compilation options, the only thing I don’t know and could not find the information is the -mssse3 option. Can this help somehow? Image resolution 4608x3456 — Ivan Chernikov, Apr 29 '20 at 17:31
Ok. `-mssse3` is an option to enable the SSSE3 instruction set enabling vectorization. It should not help here because compilers will likely fail to *automatically* vectorize this code. For the parallelization, you can use OpenMP to do that very easily. A simple `#pragma omp parallel for` directive before the `y` loop should give you a good speed up (do not forget to enable OpenMP using a compiler option, like `-fopenmp` on GCC/Clang). — Jérôme Richard, May 01 '20 at 18:34

score 0 · Accepted Answer · answered May 01 '20 at 18:04

The way you access data is not cache-friendly with big images. This can be improved (for example using tiling).

Be aware that the computation time of this program is not bounded by the memory hierarchy, but by a slow sequential scalar-based computation computation caused by an inefficient data layout.

Here is the Valgrind analysis result on 10 stencil function calls using a 4608 x 3456 x 3 image:

--171871-- warning: L3 cache found, using its data for the LL simulation.
--171871-- warning: specified LL cache: line_size 64  assoc 12  total_size 9,437,184
--171871-- warning: simulated LL cache: line_size 64  assoc 18  total_size 9,437,184
==171871== 
==171871== I   refs:      30,437,416,378
==171871== I1  misses:               944
==171871== LLi misses:               938
==171871== I1  miss rate:           0.00%
==171871== LLi miss rate:           0.00%
==171871== 
==171871== D   refs:       7,526,412,742  (7,000,797,573 rd   + 525,615,169 wr)
==171871== D1  misses:        30,599,898  (   22,387,768 rd   +   8,212,130 wr)
==171871== LLd misses:        15,679,220  (    7,467,138 rd   +   8,212,082 wr)
==171871== D1  miss rate:            0.4% (          0.3%     +         1.6%  )
==171871== LLd miss rate:            0.2% (          0.1%     +         1.6%  )
==171871== 
==171871== LL refs:           30,600,842  (   22,388,712 rd   +   8,212,130 wr)
==171871== LL misses:         15,680,158  (    7,468,076 rd   +   8,212,082 wr)
==171871== LL miss rate:             0.0% (          0.0%     +         1.6%  )

First of all, we can see that there is a lot of D1 cache reference (47 per written pixel, due to scalar accesses).

The number of D1 misses and the D1 miss rate seems very small so we could think the cache access are good. But it not the case. Valgrind results are very misleading because D1 cache reference works at the 1-byte scalar granularity while D1 misses works at the 64-byte cache line granularity.

If we look the analysis more carefully, we can see that almost all written data cause D1 cache misses. Moreover, 20% of the data read from L1 need to be (re)loaded from the LLC (for the 7 GB read from the D1 cache, 1.43 GB are loaded from the LLC).

The problem comes from the large lines of the image. Indeed, 1 image line takes 4608 * 3 = 13824 bytes and 3 lines are needed to write one line. Thus, 41472 bytes are read from the D1 cache to write 13824 bytes. Theoretically, with a big enough D1 cache, 27648 should be reused from the D1 cache. However, the D1 cache is too small and cannot fit all read/written data (roughly because 41472+13824 > 32768).

We can improve the situation using two methods:

Tiling: the idea is to set the x loop bound at a small fixed size (eg. 1024) so that it will compute verticals blocks of the image, and then add a new encompassing xBlock loop (that includes the y and x loops) which iterate over the vertical blocks. This approach has the benefit of improving the D1 cache reuse, but may still cause some unnecessary cache misses on modern processors (since loaded lines are not contiguous anymore). You can mitigate this by performing a tiling on the y dimension or by adapting the data layout. Here is an example:

const int blockSize = 1024;
for(int xBlock = 0; xBlock < width; xBlock+=blockSize) {
    for(int y = 0; y < height; y++) {
        int xMax = xBlock + blockSize;
        if(xMax > width)
            xMax = width;
        for(int x = xBlock; x < xMax; x++) {
            // Unchanged part of the code
        }
    }
}

Non-temporal stores: writing data to the cache hierarchy cause cache pollution leading to additional cache misses (as written data require the eviction of previously read cache lines). It may also cause more loads on some architectures (eg. like on Intel processors). We can tell to the processor to write data directly into the memory. Some compilers provide #pragma directives to do that (doing that manually is a bit tricky in your code). This is good only if the image is too big to fit in the LLC cache or not reused later. See here for more information.

Here is the Valgrind result after applying the tiling method:

==183432== D   refs:       7,527,450,132  (7,001,731,093 rd   + 525,719,039 wr)
==183432== D1  misses:        16,025,288  (    7,640,368 rd   +   8,384,920 wr)
==183432== LLd misses:        16,024,790  (    7,639,918 rd   +   8,384,872 wr)

We can see that there are now 2 times less D1 cache misses (and 3 times less reads).

Optimization of the algorithm when working with images in C (image convolution)

1 Answers1