What is the most efficient way to implement a convolution filter within a pixel shader?

Question

Implementing convolution in a pixel shader is somewhat costly as to the very high number of texture fetches.

A direct way of implementing a convolution filter is to make N x N lookups per fragment using two for cycles per fragment. A simple calculation says that a 1024x1024 image blurred with a 4x4 Gaussian kernel would need 1024 x 1024 x 4 x 4 = 16M lookups.

What can one do about this?

Can one use some optimization that would need less lookups? I am not interested in kernel-specific optimizations like the ones for the Gaussian (or are they kernel specific?)
Can one at least make these lookups faster by somehow exploiting the locality of the pixels one would work with?

Thanks!

Damon · Accepted Answer · 2011-03-09T19:34:03.040

21

Gaussian kernels are separable, which means you can do a horizontal pass first, then a vertical pass (or the other way around). That turns O(N^2) into O(2N). That works for all separable filters, not just for blur (not all filters are separable, but many are, and some are "as good as").

Or,in the particular case of a blur filter (Gauss or not), which are all kind of "weighted sums", you can take advantage of texture interpolation, which may be faster for small kernel sizes (but definitively not for large kernel sizes).

EDIT: image for the "linear interpolation" method

The "linear interpolation method"

EDIT (as requested by Jerry Coffin) to summarize the comments:

In the "texture filter" method, linear interpolation will produce a weighted sum of adjacent texels according to the inverse distance from the sample location to the texel center. This is done by the texturing hardware, for free. That way, 16 pixels can be summed in 4 fetches. Texture filtering can be exploited in addition to separating the kernel.

In the example image, on the top left, your sample (the circle) hits the center of a texel. What you get is the same as "nearest" filtering, you get that texel's value. On the top right, you are in the middle between two texels, what you get is the 50/50 average between them (pictured by the lighter shader of blue). On the bottom right, you sample in between 4 texels, but somewhat closer to the top left one. That gives you a weighted average of all 4, but with the weight biased towards the top left one (darkest shade of blue).

The following suggestions are courtesy of datenwolf (see below):

"Another methods I'd like suggest is operating in fourier space, where convolution turns into a simple product of fourier transformed signal and fourier transformed kernel. Although the fourier transform on the GPU itself is quite tedious to implement, at least using OpenGL shaders. But it's quite easy done in OpenCL. Actually I implement such things using OpenCL, now, a lot of image processing in my 3D engine happens in OpenCL.

OpenCL has been specifically designed for running on GPUs. A Fast Fourier Transform is actually the piece of example code on Wikipedia's OpenCL article: en.wikipedia.org/wiki/OpenCL and yes the performance gain is tremendous. A FFT executes with at most O(n log n), the reverse the same. The filter kernel fourier representation can be precomputed. The way is FFT -> multiply with kernel -> IFFT, which boils down to O(n + 2n log n) operations. Take note the the actual convolution is just O(n) there.

In the case of a separable, finite convolution like a gaussian blur the separation solution will outperform the fourier method. But in case of generalized, possible non-separable kernels the fourier methods is probably the fastest method available. OpenCL integrates nicely with OpenGL, e.g. you can use OpenGL buffers (textures and vertex) for both input and ouput of OpenCL programs."

edited Mar 09 '11 at 19:34

answered Mar 09 '11 at 10:01

Damon

67,688
20
135
185

1

To elaborate the "texture filter" method, you can get all 16 pixels in a 4x4 block summed/averaged with only 4 samples placed in between the texels. For box blur, this is entirely trivial, just place them half-way. For a Gaussian, it's a bit more complicated, the taps need to be placed a bit closer to the center so these pixels have more weight (though I don't have the exact distance at hand now). All in all, that would be 4 instead of 8 fetches for separating the kernel, and 16 for the non-optimized version. – Damon Mar 09 '11 at 10:09
Thanks! I've heard about both ideas, but not in the more general view you gave them. Could you explain a bit more about how and why texture interpolation helps and how it is related to the Gaussian-like filters? – Albus Dumbledore Mar 09 '11 at 10:12
Could I ask what you mean by *taps*? Thanks! – Albus Dumbledore Mar 09 '11 at 10:13
1

Linear interpolation helps insofar as instead of sampling the texture so exactly exactly one texel is hit, you can place your sample points ("taps") so that they are somewhere in between. The interpolation hardware will return a weighted average (weighted by the inverse distance to the adjacent texel centers) at no extra cost. I will try to come up with a drawing, that's maybe easier. – Damon Mar 09 '11 at 10:40
1

@Albus Dumbledore: The gaussian profile is a continous, differentiable distribution. For a discrete filtering process this profile must be turned into a finite vector of discrete elements. One may call it samples, but usually those refer to the samples of the filter. So on the kernel it's taps on the filter profile. – datenwolf Mar 09 '11 at 10:43
1

Another methods I'd like suggest is operating in fourier space, where convolution turns into a simple product of fourier transformed signal and fourier transformed kernel. Although the fourier transform on the GPU itself is quite tedious to implement, at least using OpenGL shaders. But it's quite easy done in OpenCL. Actually I implement such things using OpenCL, now, a lot of image processing in my 3D engine happens in OpenCL. – datenwolf Mar 09 '11 at 10:46
1

Added an image that hopefully helps understand. On the top left, your sample (the circle) hits the center of a texel. What you get is the same as "nearest" filtering, you get that texel's value. On the top right, you are in the middle between two texels, what you get is the 50/50 average between them (pictured by the lighter shader of blue). On the bottom right, you sample in between 4 texels, but somewhat closer to the top left one. That gives you a weighted average of all 4, but with the weight biased towards the top left one (darkest shade of blue). – Damon Mar 09 '11 at 10:50
@datenwolf, could you show me some samples of this idea? Does it give any performance benefit? Do you thing OpenCL works relatively well on the GPU? Thanks! – Albus Dumbledore Mar 09 '11 at 10:59
1

@Albus Dumbledore: OpenCL has been specifically designed for running on GPUs. A Fast Fourier Transform is actually the piece of example code on Wikipedia's OpenCL article: http://en.wikipedia.org/wiki/OpenCL and yes the performance gain is tremendous. A FFT executes with at most O(n log n), the reverse the same. The filter kernel fourier representation can be precomputed. The way is FFT -> multiply with kernel -> IFFT, which boils down to O(n + 2n log n) operations. Take note the the actual convolution is just O(n) there. – datenwolf Mar 09 '11 at 11:08
In the case of a separable, finite convolution like a gaussian blur the separation solution will outperform the fourier method. But in case of generalized, possible non-separable kernels the fourier methods is probably the fastest method available. – datenwolf Mar 09 '11 at 11:08
1

@Albus Dumbledore: OpenCL integrates nicely with OpenGL, e.g. you can use OpenGL buffers (textures and vertex) for both input and ouput of OpenCL programs. – datenwolf Mar 09 '11 at 11:10
@dm.skt: +1 -- but if possible, I'd suggest editing the answer to include the information that's now in the comments. In this case, the comments contain about as much information as the answer (and maybe even more). – Jerry Coffin Mar 09 '11 at 15:10

score 4 · Answer 2 · answered Mar 09 '11 at 11:02

4

More than being separable, Gaussian filters are also computable in O(1) :

There are recursive computations like the Deriche one :

http://hal.inria.fr/docs/00/07/47/78/PDF/RR-1893.pdf

answered Mar 09 '11 at 11:02

fa.

2,456
16
17

1

Recursive Gaussian sounds cool, though I am not sure what you mean by O(1) in this situation. Thanks! – Albus Dumbledore Mar 09 '11 at 11:03
1

O(1) in this case means the number of samples you take does not depend on the size of the kernel. It does not necessarily mean that it will be faster or that there will be any fewer, it just says that the number of samples does not depend on the kernel size. Importance sampling does a similar job, but it is only an approximation. Getting O(1) is not hard if "approximately" is good enough. You could for example settle for a fixed "budget" of 10 samples. Then you use some scattered sample pattern and multiply each value with the correct gaussian weight according to its distance. – Damon Mar 09 '11 at 11:34
2

Summed area tables are another example of doing a blur (box, in this case, not gaussian) in O(1). The downside is that you have to generate the SAT in the first place. Whether it is "win" or "lose" overall depends on your situation. – Damon Mar 09 '11 at 11:44
Thanks dm.skt! Nicely explained. By the way, I forgot about SAT. :-D Good of you to remind me... – Albus Dumbledore Mar 09 '11 at 12:19

score 3 · Answer 3 · edited May 23 '17 at 12:31

3

Rotoglup's answer to my question here my be worth reading; in particular, this blog post about Gaussian blur really helped me understand the concept of separable filters.

edited May 23 '17 at 12:31

Community

1
1

answered Mar 09 '11 at 10:46

Nicolas Lefebvre

4,247
1
25
29

score 0 · Answer 4 · answered Sep 14 '19 at 11:16

0

One more approach is approximating Gaussian curve with step-wise function: https://arxiv.org/pdf/1107.4958.pdf (I guess piece-wise linear functions can be also used of course).

answered Sep 14 '19 at 11:16

Stepan Yakovenko

8,670
28
113
206

What is the most efficient way to implement a convolution filter within a pixel shader?

4 Answers4

Linked