Optimization of 3D Direct Convolution Implementation in C

Question

For my project, I've written a naive C implementation of direct 3D convolution with periodic padding on the input. Unfortunately, since I'm new to C, the performance isn't so good... here's the code:

int mod(int a, int b)
{
    // calculate mod to get the correct index with periodic padding
    int r = a % b;
    return r < 0 ? r + b : r;
}
void convolve3D(const double *image, const double *kernel, const int imageDimX, const int imageDimY, const int imageDimZ, const int stencilDimX, const int stencilDimY, const int stencilDimZ, double *result)
{
    int imageSize = imageDimX * imageDimY * imageDimZ;
    int kernelSize = kernelDimX * kernelDimY * kernelDimZ;

    int i, j, k, l, m, n;
    int kernelCenterX = (kernelDimX - 1) / 2;
    int kernelCenterY = (kernelDimY - 1) / 2;
    int kernelCenterZ = (kernelDimZ - 1) / 2;
    int xShift,yShift,zShift;
    int outIndex, outI, outJ, outK;
    int imageIndex = 0, kernelIndex = 0;
    
    // Loop through each voxel
    for (k = 0; k < imageDimZ; k++){
        for ( j = 0; j < imageDimY; j++) {
            for ( i = 0; i < imageDimX; i++) {
                stencilIndex = 0;
                // for each voxel, loop through each kernel coefficient
                for (n = 0; n < kernelDimZ; n++){
                    for ( m = 0; m < kernelDimY; m++) {
                        for ( l = 0; l < kernelDimX; l++) {
                            // find the index of the corresponding voxel in the output image
                            xShift = l - kernelCenterX;
                            yShift = m - kernelCenterY;
                            zShift = n - kernelCenterZ;

                            outI = mod ((i - xShift), imageDimX);
                            outJ = mod ((j - yShift), imageDimY);
                            outK = mod ((k - zShift), imageDimZ);
                            
                            outIndex = outK * imageDimX * imageDimY + outJ * imageDimX + outI;

                            // calculate and add
                            result[outIndex] += stencil[stencilIndex]* image[imageIndex];
                            stencilIndex++;
                        }
                    }
                } 
                imageIndex ++;
            }
        }
    } 
}

by convention, all the matrices (image, kernel, result) are stored in column-major fashion, and that's why I loop through them in such way so they are closer in memory (heard this would help).

I know the implementation is very naive, but since it's written in C, I was hoping the performance would be good, but instead it's a little disappointing. I tested it with image of size 100^3 and kernel of size 10^3 (Total ~1GFLOPS if only count the multiplication and addition), and it took ~7s, which I believe is way below the capability of a typical CPU.

If possible, could you guys help me optimize this routine? I'm open to anything that could help, with just a few things if you could consider:

The problem I'm working with could be big (e.g. image of size 200 by 200 by 200 with kernel of size 50 by 50 by 50 or even larger). I understand that one way of optimizing this is by converting this problem into a matrix multiplication problem and use the blas GEMM routine, but I'm afraid memory could not hold such a big matrix
Due to the nature of the problem, I would prefer direct convolution instead of FFTConvolve, since my model is developed with direct convolution in mind, and my impression of FFT convolve is that it gives slightly different result than direct convolve especially for rapidly changing image, a discrepancy I'm trying to avoid. That said, I'm in no way an expert in this. so if you have a great implementation based on FFTconvolve and/or my impression on FFT convolve is totally biased, I would really appreciate if you could help me out.
The input images are assumed to be periodic, so periodic padding is necessary
I understand that utilizing blas/SIMD or other lower level ways would definitely help a lot here. but since I'm a newbie here I dont't really know where to start... I would really appreciate if you help pointing me to the right direction if you have experience in these libraries,

Thanks a lot for your help, and please let me know if you need more info about the nature of the problem

I’m voting to close this question because it is beyond the scope of this site. — Sᴀᴍ Onᴇᴌᴀ, Jun 26 '20 at 21:33
@ChristianGibbons Thanks for the suggestion, I'll post it there. — lxiangyun93, Jun 26 '20 at 21:39
@SᴀᴍOnᴇᴌᴀ I see, are there other sites/communities where this kind of questions is more appropriate? — lxiangyun93, Jun 26 '20 at 21:39
It sounds like you might already be checking out CR... See ["Which Site?"](https://meta.stackexchange.com/a/129632/341145) and ["Code Review or not?"](https://codereview.meta.stackexchange.com/a/5778/120114) — Sᴀᴍ Onᴇᴌᴀ, Jun 26 '20 at 21:41
Initial thought - the data seems very independent per loop or at least per image. Might be worth spinning up a few threads and dividing the work across cores. Outside that, I'd also double check output to see if you are getting and SIMD help at all on your system. Build flags to -O3 and such. Finally, be sure to profile this so you know where your slow points are. Seems like it would be GPU friendly as well. — Michael Dorgan, Jun 26 '20 at 21:44
If you know `xShift` will be less than `imageDimX`, you can handle wrapping with one conditional subtract or add, instead of maybe compiling to an actual division. Division is very slow so you really don't want to be doing it in your inner-most loop. — Peter Cordes, Jun 26 '20 at 21:54
@PeterCordes Thanks for the idea! I tried to use conditional subtract/add instead of % in the mod function and there is indeed a ~20% speed up. Thanks! — lxiangyun93, Jun 26 '20 at 22:31
@MichaelDorgan Thanks for the suggestions! indeed this operation is highly parallelization, how could specify parallelize this across multiple thread? Sorry i'm quite new to C... Also how could I check whether I'm using SIMD? I did compile this with -O3 flag by the way — lxiangyun93, Jun 26 '20 at 22:34
You can look at the assembly output (-S usually) and make sure you see SIMD instructions appearing. Is this for X86, ARM, or something else? For the threading ID, if you haven't done it before, it is non trivial in C to setup the threads. Easiest way might be to run your app with forks or many at the same time with different files so that the OS handles it for you. — Michael Dorgan, Jun 26 '20 at 22:43
FFT-based convolution will bring down the complexity from `O(n³k³)` to something like `(O(n³log(n)³)`, which is much smaller, if `k` (the kernel size) has the same order of magnitude as `n`. You'll need to pad the kernel with zeros to match the image-size, but then mathematically, you should get the same as your "direct convolution" (there will be numerical differences, most likely). FFT works fastest if the image size is a power of 2, or the product of small primes (256 would be great, 200 is ok, 209 would be terrible) -- but that is also a question of the FFT-library you'll use. — chtz, Jun 27 '20 at 08:28

score 3 · Accepted Answer · answered Jun 30 '20 at 17:23

As a first step, replace your mod ((i - xShift), imageDimX) with something like this:

inline int clamp( int x, int size )
{
    if( x < 0 ) return x + size;
    if( x >= size ) return x - size;
    return x;
}

These branches are very predictable because they yield same results for very large count of consecutive elements. Integer modulo is relatively slow.

Now, next step (ordered by cost/profit) is going to be parallelizing. If you have any modern C++ compiler, just enable OpenMP somewhere in project settings. After that you need 2 changes.

Decorate your very outer loop with something like this: #pragma omp parallel for schedule(guided)
Move your function-level variables within that loop. This also means you’ll have to compute initial imageIndex from your k, for each iteration.

Next option, rework your code so you only write each output value once. Compute the final value in your innermost 3 loops, reading from random locations from both image and kernel, and only write the result once. When you have that result[outIndex] += in the inner loop, CPU stalls waiting for the data from memory. When you accumulate in a variable that’s a register not memory, there’s no access latency.

SIMD is the most complicated optimization for that. But in short, you’ll need maximum width of the FMA your hardware has (if you have AVX and need double precision, that width is 4), and you’ll also need multiple independent accumulators for your 3 innermost loops, to avoid hitting the latency as opposed to saturating the throughput. Here’s my answer to much easier problem as an example what I mean.

Optimization of 3D Direct Convolution Implementation in C

1 Answers1

Linked