Fast 2D Convolution in C

Question

I'm trying to implement a convolutional neural network in Python. Originally, I was using scipy.signal's convolve2d function to do the convolution, but it has a lot of overhead, and it would be faster to just implement my own algorithm in C and call it from python, since I know what my input looks like.

I've implemented 2 functions:

Convolving a matrix with a non-separable kernel
Convolving a matrix with a separable kernel (For now I've assumed python does the rank checking and splitting before passing it onto C)

Neither of these functions has padding since I require dimensionality reduction.

Non-separable 2D Convolution

// a - 2D matrix (as a 1D array), w - kernel
double* conv2(double* a, double* w, double* result)
{
    register double acc;
    register int i; 
    register int j;
    register int k1, k2;
    register int l1, l2;
    register int t1, t2;

    for(i = 0; i < RESULT_DIM; i++) 
    {
        t1 = i * RESULT_DIM; // loop invariants
        for(j = 0; j < RESULT_DIM; j++) 
        {   
            acc = 0.0;
            for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
            {
                t2 = k1 * FILTER_DIM;  // loop invariants
                for(l1 = FILTER_DIM - 1, l2 = 0; l1 >= 0; l1--, l2++)
                {
                    acc += w[t2 + l1] * a[(i + k2) * IMG_DIM + (j + l2)];
                }
            }
            result[t1 + j] = acc;
        }
    }

    return result;
}

Separable 2D Convolution

// a - 2D matrix, w1, w2 - the separated 1D kernels
double* conv2sep(double* a, double* w1, double* w2, double* result)
{
    register double acc;
    register int i; 
    register int j;
    register int k1, k2;
    register int t;
    double* tmp = (double*)malloc(IMG_DIM * RESULT_DIM * sizeof(double));

    for(i = 0; i < RESULT_DIM; i++) // convolve with w1 
    {
        t = i * RESULT_DIM;
        for(j = 0; j < IMG_DIM; j++)
        {
            acc = 0.0;
            for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
            {
                acc += w1[k1] * a[k2 * IMG_DIM + t + j];
            }
            tmp[t + j] = acc;
        }
    }

    for(i = 0; i < RESULT_DIM; i++) // convolve with w2
    {
        t = i * RESULT_DIM;
        for(j = 0; j < RESULT_DIM; j++)
        {
            acc = 0.0;
            for(k1 = FILTER_DIM - 1, k2 = 0; k1 >= 0; k1--, k2++)
            {
                acc += w2[k1] * tmp[t + (j + k2)];
            }

            result[t + j] = acc;
        }
    }

    free(tmp);
    return result;
}

Compiling with gcc's -O3 flag and testing on a 2.7GHz Intel i7, using a 4000x4000 matrix and 5x5 kernel, I get respectively (avg of 5):

271.21900 ms
127.32000 ms

This is still a considerable improvement over scipy.signal's convolve2d which takes around 2 seconds for the same operation, but I need more speed since I'll be calling this function thousands of times. Changing the data type to float isn't an option at the moment, even though it'd cause a considerable speedup.

Is there a way I can optimise these algorithms further? Can I apply any cache tricks or routines to speed it up?

Any suggestions would be appreciated.

"implement my own algorithm in C" - yeah, that could be pretty fast if you do it right. "port it to python" - wait, what? *Port* it? I hope you're just using the wrong term here, because a Python port would be slow as hell. Optimizing Python is very different from optimizing C. — user2357112, Jun 29 '16 at 16:49
Minor point: Rather than `double* tmp = (double*)malloc(IMG_DIM * RESULT_DIM * sizeof(double));`, recommend `double* tmp = malloc(sizeof *tmp * IMG_DIM * RESULT_DIM);` 1) cast not needed 2) `*tmp` easier to maintain than `double` 3) No `int` overflow issues. — chux - Reinstate Monica, Jun 29 '16 at 16:49
@user2357112 What I mean by "porting" it isn't a duplication of code in Python, I'll just be calling the C code through some sort of interface, like types. What's the right terminology for that? — cs95, Jun 29 '16 at 17:21
Yeah, that's not porting. I'm not sure if there's a specific term for that; I'd just say "accessing it from Python". If there was some work on the library's side to provide a Python interface, I'd say "providing Python bindings". — user2357112, Jun 29 '16 at 17:27
@user2357112: Thanks for letting me know! I've edited that part. — cs95, Jun 29 '16 at 17:57
@markgz: What are those flags? I can't find anything on the internet about those. — cs95, Jun 29 '16 at 17:58
See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-march-2850 — markgz, Jun 30 '16 at 17:43
Depending on your needs it might be worth testing [scipy.ndimage.filters.convolve](https://docs.scipy.org/doc/scipy-0.16.1/reference/generated/scipy.ndimage.filters.convolve.html#scipy.ndimage.filters.convolve) or [scipy.signal.fftconvolve](https://docs.scipy.org/doc/scipy-0.16.1/reference/generated/scipy.signal.fftconvolve.html). From some tests I've done, both of these can outperform scipy.signal.convolve2d, but it will depend on the size of the two arrays being convolved. — user3731622, Feb 25 '17 at 00:35

score 2 · Accepted Answer · edited May 23 '17 at 12:31

If you're running on x86 only then consider using SSE or AVX SIMD optimisation. For double data the throughput improvement will be modest, but if you can switch to float then you may be able to get to around 4x improvement with SSE or 8x with AVX. There are a number of questions and answers about this very topic on StackOverflow already from which you may be able to get some ideas on the implementation. Alternatively there are also many libraries available which include high performance 2D convolution (filtering) routines, and these typically exploit SIMD for performance, e.g. Intel's IPP (commercial), or OpenCV (free).

Another possibility is to exploit multiple cores - split your image into blocks and run each block in its own thread. E.g. if you have a 4 core CPU then split your image into 4 blocks. (See pthreads).

You can of course combine both of the above ideas, if you really want to fully optimise this operation.

Some small optimisations which you can apply to your current code, and to any future implementations (e.g. SIMD):

if your kernels are symmetric (or odd-symmetric) then you can reduce the number of operations by adding (subtracting) symmetric input values and performing one multiply rather than two
for the separable case, rather than allocating a full frame temporary buffer, consider using a "strip-mining" approach - allocate a smaller buffer, which is full width, but a relatively small number of rows, then process your image in "strips", alternately applying the horizontal kernel and the vertical kernel. The advantage of this is that you have a much more cache-friendly access pattern and a smaller memory footprint.

A few comments on coding style:

the register keyword has been redundant for many years, and modern compilers will emit a warning if you try to you use it - save yourself some noise (and some typing) by ditching it
casting the result of malloc in C is frowned upon - it's redundant and potentially dangerous.
make any input parameters const (i.e. read-only) and use restrict for any parameters which can never be aliased (e.g. a and result) - this can not only help to avoid programming errors (at least in the case of const), but in some cases it can help the compiler to generate better optimised code (particularly in the case of potentially aliased pointers).

Thanks! I'll have to read up on SSE, because I haven't worked with it before. Would you also happen to know of any BLAS/LAPACK routines that can be applied here? — cs95, Jun 29 '16 at 16:31
Not BLAS/LAPACK, but you might want to look at other libraries, e.g. Intel's IPP (commercial), or OpenCV (free). — Paul R, Jun 29 '16 at 16:32
So will I have to explicitly write code with SSE? I looked around and it seems gcc automatically generates SSE code with the -O3 flag set. — cs95, Jun 29 '16 at 19:31
gcc and other compilers can *sometimes* auto-vectorise - it's even possible that your code is already being vectorised (check the generated code). Humans can generally do better though, especially if the compiler fails to vectorise at all. — Paul R, Jun 29 '16 at 19:34

Fast 2D Convolution in C

Non-separable 2D Convolution

Separable 2D Convolution

1 Answers1

Linked