0

I have a really complex HLSL shader doing tons of texture reads, using shader model 3 in Direct3D9. The complex code is only used at some pixels so I put an if-statement around that block of code. To my surprise this gives no performance gain at all. If I use clip(-1) instead I do see an enormous performance boost, so this shader is indeed the bottleneck of my program. Why doesn't the branching improve my performance without the clip(-1) line?

I found this topic: How much performance do conditionals and unused samplers/textures add to SM2/3 pixel shaders? This topic states that in shader model 3 it is possible to optimise with branching, but the performance is that of the worst of each batch of pixels. In may case the slow branch is taken mostly at the edges of the screen and the fast branch is mostly at the centre of the screen. I think this means that batches of pixels will generally take the same branch, so I would expect a performance gain this way.

In pseudo-code the pixel shader looks like this:

float4 colour = tex2D(texture, uv);
if (colour.a < 0.5f)
{
    //I only get a performance boost if I replace this line with clip(-1);
    oColour = colour;
}
else
{
    complexSlowCodeWithTonsOfTextureReadsGoesHere;
    oColour = result;
}
oColour *= 2;

This gives me the exact same performance as when I remove the branching and always use the code in the slow else-branch. If I replace the fifth line with clip(-1) I see an enormous performance boost (and a mostly black screen) so the if-statement is actually functioning.

Am I doing something wrong here or is it not possible to optimise a shader like this in shader model 3?

Community
  • 1
  • 1
Oogst
  • 358
  • 2
  • 14
  • Is your shader simply running once in a fullscreen pass? Or is it used to draw several objects. In the first case using stencil can give you pretty good performance boost. – mrvux Oct 20 '14 at 14:34
  • It is a fullscreen posteffect. Using a stencil buffer for this would be pretty complex but possible. Gnietschow's reply below however already answers my question really well. :) (I only just discovered I could click the V sign to indicate my question has been answered...) – Oogst Oct 20 '14 at 18:12

2 Answers2

1

The problem is that your if will be flattened (both executed, result of the wrong branch discarded), because you're using gradient functions like tex2D in one of your branches (doc). You should see the performance gain if you remove those functions from your branches or replace them with non-gradient functions like tex2Dlod or tex2Dgrad. The compiler would help to find the problematic lines, if you add [branch] before your if. This will hint the compiler that you want a real branching if, which will fail at compilation if you're using gradient functions.

As far as my experience goes, the gpu computes the output with 2x2 fragments. This is needed to compute the right miplevel to use for the texture lookup, wherefore the information of the neighbours is needed. This prevents the tex2D functions from branched away, because they are needed of the adjacent operations. If you give the gpu the needed information by passing the miplevel the other fragments aren't needed anymore, so the branch can be skipped in real.

Gnietschow
  • 3,070
  • 1
  • 18
  • 28
  • Thanks for the very clear explanation! I switched to using tex2Dlod and I indeed see the expected performance gain now! :) By the way, I clearly see in my framerate that clip(-1) works with tex2D, which surprises me since that would also interfere with getting the miplevel information from the neighbours. – Oogst Oct 19 '14 at 19:38
  • This point already confused me, unfortunately I've no explaination for this. The only I can imagine, that such clipped fragments result in a special value. I hope the others can enlight us :) – Gnietschow Oct 19 '14 at 20:57
0

Use the Z/Stencil buffer to mask off areas that you don't want the shader to run on.

Dwedit
  • 618
  • 5
  • 11