How does doing multiple renderpasses for post processing balance against simply sampling more in a single pass performance-wise?

Question

I want to average/blur pixels of a texture in an n*n square. For this purpose I have one shader for horizontal and one for vertical averaging. My question is how profitable using a single pass becomes when n is small.

For example, when n is 4, I can do everything in one pass with 4 samples (by sampling on shared corners of pixels with GL_LINEAR, getting a free equally weighted average) instead of in two passes with 2 samples each.

If n was 10, it would be obvious to go for two passes with 5 samples each, totalling 10, rather than one pass with 25 samples.

But how about i.e. n of 6? This would be 2*3=6 samples in two passes, or 9 samples in one pass. Is there a rule of thumb of how additional sampling balances out against the flat cost of doing an additional render to texture?

There is probably no one answer to that question. It is likely that it depends on hardware, configuration and other parameters. You could try to benchmark the problem on several machines and see if you can find a reasonable threshold. Otherwise I would always go with the separated two-pass version. The difference is probably not worth implementing two variations. — Nico Schertler, Feb 05 '14 at 18:52
By the way, you can fetch multiple samples at once without using `GL_LINEAR` on Shader Model 4.1 hardware. It exposes "`gather4`", which basically allows you to return component-wise, the samples that the hardware would have used for linear filtering. If you structure your algorithm properly, this can be a tremendous boon. Say you only need luminance, you can fetch a single channel of each of the 4 texels and return a single `vec4`, even better the 4 samples are not interpolated in any way at this point so you can do even more with them than you could with a weighted average from `GL_LINEAR`. — Andon M. Coleman, Feb 05 '14 at 19:13

score 1 · Accepted Answer · edited Apr 13 '17 at 12:18

1

As with everything: Profile it for your target hardware.

Other than that, generally speaking, it is more advantageous to split blur into 2 phases because of one thing: cache coherency.

The way the average blur filter is split into 2 passes( 2 kernels ) uses linear memory access, and is thus faster even if it has to be run 2 times.

Note: this might not be true for your hardware, so profile it!(and share the results)

Relevant question on GD.SE.

edited Apr 13 '17 at 12:18

Community

1
1

answered Feb 05 '14 at 20:11

akaltar

1,002
1
19
25

But there are a lot of shader instances requesting samples from many different locations in the first place. How does effective caching even work at all in such an environment? I'm struggling to find a good read on this. – Zyl Feb 17 '14 at 19:48
I cannot back that up with reference now, can't seem to find the article. Also I forgot to add that Its also better because of the smaller complexity(as you mentioned in the question, and as mentioned in the relevant question). – akaltar Feb 19 '14 at 11:03
The following source mentions "data prefetchers" which try to understand and "guess" your access patterns. http://www.sisoftware.net/?d=qa&f=gpu_mem_latency Sounds crazy in my opinion, but might just work if manufactured with utmost wisdom. So yeah, profiling appears to be the way to go. Different hardware varies crazily in how it works. – Zyl Feb 20 '14 at 21:49
@Zyl If you think such predictions are crazy, just look at how much branch prediction helps.[link](http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array) – akaltar Feb 21 '14 at 14:49

How does doing multiple renderpasses for post processing balance against simply sampling more in a single pass performance-wise?

1 Answers1