Optimized image convolution algorithm

Question

I am working on implementing Image convolution in C++, and I already have a naive working code based on the given pseudo code:

for each image row in input image:
   for each pixel in image row:

      set accumulator to zero

      for each kernel row in kernel:
         for each element in kernel row:

            if element position  corresponding* to pixel position then
               multiply element value  corresponding* to pixel value
               add result to accumulator
            endif

      set output image pixel to accumulator

As this can be a big bottleneck with big Images and Kernels, I was wondering if there exist some other approach to make things faster ? even with additionnal input info like : sparse image or kernel, already known kernel etc...

I know this can be parallelized, but it's not doable in my case.

Some kernels are linearly separable (Google this). This provides a huge speedup (order of magnitude). — MFisherKDX, Mar 02 '18 at 03:42
Optimization without knowing platform on which you want to run is not possible how do we know what it is capable of or what is faster or slower there (when you did not specify it at all)? when parallelization is not possible then there are other means like SIMD , auto-generated hard-coded convolution, reuse of sub results and possibly a lot of other techniques .... — Spektre, Mar 02 '18 at 08:20
@Spektre it's because I am talking about the algorithm, that is to say a better way/approach to code this. — Pierre Baret, Mar 02 '18 at 17:06
Take a look at [this](http://ieeexplore.ieee.org/document/8324097/) and [this](http://ieeexplore.ieee.org/document/8310675/) papers — Amiri, Mar 31 '18 at 11:41

score 11 · Accepted Answer · answered Mar 02 '18 at 06:18

if element position  corresponding* to pixel position then

I presume this test is meant to avoid a multiplication by 0. Skip the test! multiplying by 0 is way faster than the delays caused by a conditional jump.

The other alternative (and it's always better to post actual code rather than pseudo-code, here you have me guessing at what you implemented!) is that you're testing for out-of-bounds access. That is terribly expensive also. It is best to break up your loops so that you don't need to do this testing for the majority of the pixels:

for (row = 0; row < k/2; ++row) {
   // inner loop over kernel rows is adjusted so it only loops over part of the kernel
}
for (row = k/2; row < nrows-k/2; ++row) {
   // inner loop over kernel rows is unrestricted
}
for (row = nrows-k/2; row < nrows; ++row) {
   // inner loop over kernel rows is adjusted
}

Of course, the same applies to loops over columns, leading to 9 repetitions of the inner loop over kernel values. It's ugly but way faster.

To avoid the code repetition you can create a larger image, copy the image data over, padded with zeros on all sides. The loops now do not need to worry about accessing out-of-bounds, you have much simpler code.

Next, a certain class of kernel can be decomposed into 1D kernels. For example, the well-known Sobel kernel results from the convolution of [1,1,1] and [1,0,-1]^T. For a 3x3 kernel this is not a huge deal, but for larger kernels it is. In general, for a NxN kernel, you go from N² to 2N operations.

In particular, the Gaussian kernel is separable. This is a very important smoothing filter that can also be used for computing derivatives.

Besides the obvious computational cost saving, the code is also much simpler for these 1D convolutions. The 9 repeated blocks of code we had earlier become 3 for a 1D filter. The same code for the horizontal filter can be re-used for the vertical one.

Finally, as already mentioned in MBo's answer, you can compute the convolution through the DFT. The DFT can be computed using the FFT in O(MN log MN) (for an image of size MxN). This requires padding the kernel to the size of the image, transforming both to the Fourier domain, multiplying them together, and inverse-transforming the result. 3 transforms in total. Whether this is more efficient than the direct computation depends on the size of the kernel and whether it is separable or not.

Excellent and through answer. I am very skeptical about the FFT solution for kernels of small and moderate size. — , Mar 02 '18 at 08:12
@YvesDaoust: I agree. You need to time it to see if it's faster. And then you need to decide what boundary conditions you need, the periodic assumption of the DFT is not always useful. — Cris Luengo, Mar 02 '18 at 13:54
Thanks for this very interesting and complete answer, exactly the kind of tip I was looking for ! — Pierre Baret, Mar 02 '18 at 17:10

MBo · Answer 2 · 2018-03-02T09:17:07.807

3

For small kernel size simple method might be faster. Also note that separable kernels (for example, Gauss kernel is separable) as mentioned, allow to make filtering by lines then by columns, resulting O(N^2 * M) complexity.

For other cases: there exists fast convolution based on FFT (Fast Fourier Transform). It's complexity is O(N^2*logN) (where N is size of image ) comparing to O(N^2*M^2) for naive implementation.

Of course, there some peculiarities in applying this techniques, for example, edge effects, but one needs to account for them in naive implementation too (in a lesser degree though).

 FI = FFT(Image)
 FK = FFT(Kernel)
 Prod = FI * FK (element-by-element complex multiplication)
 Conv(I, K) = InverseFFT(Prod)

Note that you can use some fast library intended for image filtering, for example, OpenCV allows to apply kernel to 1024x1024 image in 5-30 milliseconds.

edited Mar 02 '18 at 09:17

answered Mar 02 '18 at 03:41

MBo

77,366
5
53
86

1

For small kernels, M² beats Log N, not counting that the hidden constant is much better for the naive implementation. – Mar 02 '18 at 08:11
@Yves Daoust OK, added this info to answer (while I think that author understand that there is no silver bullet for all cases) – MBo Mar 02 '18 at 09:22
Thanks for this answer. – Pierre Baret Mar 02 '18 at 17:08
that pdf link is unreadable (possibly some weird font) how do you handle different resolution of N and M (how to piece wise multiply them when they do not have the same resolution)? Is a non power of 2 resolution a problem (and how to handle it if zero padding changes results) or not? btw here [How to compute Discrete Fourier Transform?](https://stackoverflow.com/a/26355569/2521214) are some links with my C++ implementations for DFT/DFFT/DCT/DFCT – Spektre Mar 05 '18 at 08:14
@Spektre I don't see problems with pdf. This is widely known "Numerical Recipes in C" chapter.Perhaps your reader makes font change.Other questions - I feel they are rhetorical and you know all answers. – MBo Mar 05 '18 at 08:30
@MBo that is weird I just tried it again and suddenly all the texts are ok before that (25mins ago) all the texts where written over itself to point it was unreadable now I see pages of text before it was just few words (like all the words where printed on the same position). – Spektre Mar 05 '18 at 08:37
1

@Spektre Official site (perhaps pdf is the same): http://www.nrbook.com/a/bookcpdf.php All chapters in a single pdf: https://www2.units.it/ipl/students_area/imm2/files/Numerical_Recipes.pdf – MBo Mar 05 '18 at 08:46

score 1 · Answer 3 · answered Feb 20 '20 at 21:12

One way to this speed up, might be, depending on target platform, to distinctly get every value in the kernel, then, in memory, store multiple copies of the image, one for every distinct value in the kernel, and multiply each copy of the image by its distinct kernel value, then at the end, multiply by distinct kernel value, shift, sum and divide up all the image copies into one image. This could be done on a graphics processor for example where memory is ample and which is more suited for this tight repetitive processing. The copies of the image will need to support overflow of the pixels, or you could use floating point values.

Optimized image convolution algorithm

3 Answers3

Linked