20

I'm interested in information about the speed of sin() and cos() in Open GL Shader Language.

The GLSL Specification Document indicates that:

The built-in functions basically fall into three categories:

  • ...
  • ...
  • They represent an operation graphics hardware is likely to accelerate at some point. The trigonometry functions fall into this category.

EDIT:

As has been pointed out, counting clock cycles of individual operations like sin() and cos() doesn't really tell the whole performance story.

So to clarify my question, what I'm really interested in is whether it's worthwhile to optimize away sin() and cos() calls for common cases.

For example, in my application it'll be very common for the argument to be 0. So does something like this make sense:

float sina, cosa;

if ( rotation == 0 )
{
   sina = 0;
   cosa = 1;
}
else
{
   sina = sin( rotation );
   cosa = cos( rotation );
}

Or will the GLSL compiler or the sin() and cos() implementations take care of optimizations like that for me?

ulmangt
  • 5,343
  • 3
  • 23
  • 36
  • 1
    What do you mean do "modern GPUs provide hardware acceleration for `sin()` and `cos()`?" If it's running on the GPU it can be said to be hardware accelerated. In any event your best bet is to try it out and profile it, as clock cycles on a GPU are somewhat meaningless without more context as to what you're doing. Even between different cards from the same vendor, there can be differences in number of execution units, so cycles only tells you part of the story. – user1118321 Apr 14 '12 at 16:03
  • 1
    With those GPUs, I think you'll have the fastest possible execution time of those trigonometric functions. Interesting question... – Radu Murzea Apr 14 '12 at 16:06
  • As pointed out in [this](http://stackoverflow.com/questions/10111898/glsl-relative-to-each-other-how-expensive-are-operations-like-multiply-divide) and [this](http://stackoverflow.com/questions/8415251/performance-of-different-cg-glsl-hlsl-functions) question, this question is essentially unanswerable. A particular use of `sin` might cost *nothing*, depending on where you use it and the hardware. – Nicol Bolas Apr 14 '12 at 16:44
  • @user1118321 Good points. I've modified my question to try to make it a little more explicit. – ulmangt Apr 14 '12 at 19:17
  • @NicolBolas Thanks for the links. http://stackoverflow.com/questions/8415251/performance-of-different-cg-glsl-hlsl-functions is particularly informative regarding why simply counting gpu execution unit clock cycles doesn't tell the whole performance story. I've edited my question to try to more explicitly address whether the particular optimization that I'm thinking about making is worthwhile. – ulmangt Apr 14 '12 at 19:19
  • 1
    For the above, you might find the shader executes both branches and only then decides which result to make use of. The kind of optimisation you're making here is, in my opinion, not worth the trouble and may even result in a reduction in performance, not an increase. – Robinson Apr 14 '12 at 19:32
  • Hmm, don't know if it is reasonable to assume some kind of optimization for specific `uniform` vars. Doesn't make sense for `in/attribute` vars, though. – Stefan Hanke Apr 16 '12 at 05:07
  • I'm voting to close this question as off-topic because this question is basically asking "how fast is this operation in this language", whichis unanswerable, because it depends on compiler, platform, and a bunch of other things, none of which were specified. – rubenvb Nov 08 '16 at 15:15
  • @Robinson that's not good advice for a long time now. If the branch is on a uniform, or even dynamic but a lot of waves in the wavefront take the same path, it can be faster. Whether it's worth it in the case of sin/cos is up to measurement, though. – Steve May 04 '21 at 20:58
  • I think at 9 years old, questions and replies about performance are somewhat out of date, yes. – Robinson May 06 '21 at 11:32

6 Answers6

22

For example, in my application it'll be very common for the argument to be 0. So does something like this make sense:

No.

Your compiler will do one of two things.

  1. It will issue an actual conditional branch. In the best possible case, if 0 is a value that is coherent locally (such that groups of shaders will often hit 0 or non-zero together), then you might get improved performance.
  2. It will evaluate both sides of the condition, and only store the result for the correct one of them. In which case, you've gained nothing.

In general, it's not a good idea to use conditional logic to dance around small performance like this. It needs to be really big to be worthwhile, like a discard or something.

Also, do note that floating-point equivalence is not likely to work. Not unless you actually pass a uniform or vertex attribute containing exactly 0.0 to the shader. Even interpolating between 0 and non-zero will likely never produce exactly 0 for any fragment.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • 1
    I would be actually passing the 0.0 value to the shader as a vertex attribute. But good point, if I wasn't testing that the value is some small epsilon away from 0 would probably be necessary. But point taken about it probably not being worthwhile in the first place. – ulmangt Apr 14 '12 at 21:01
  • Depending on the amount of work each shader has to do, you might win by having two variants of it, one for where you know it's zero and one where it isn't. But switching shader isn't cheap, so it depends on the workload. – Robinson Apr 14 '12 at 21:03
  • 1
    @NicolBolas And actually, after reading your answer and remembering some of my CUDA, I think there's a third option: the shader may evaluate the first side of the condition for the threads where `rotation==0` while the others block (or noop), then evaluate the second side while the first block. Which would obviously be bad as well. Although that's assuming shaders evaluate similarly to CUDA kernels. – ulmangt Apr 14 '12 at 21:05
  • Sometimes `discard` is really expensive too. If you don't mind writing Z, or aren't writing Z anyway, a zero alpha write can be much faster. (I've gotten 100+ percent speed-ups replacing discards with 0 alpha draws.) GPUs love it when all of the threads are doing the same thing. – doug65536 Apr 08 '16 at 05:46
9

This is a good question. I too wondered this.

Google'd links say cos and sin are single-cycle on mainstream cards since 2005 or so.

Will
  • 73,905
  • 40
  • 169
  • 246
5

You'd have to test this out yourself, but I'm pretty sure that branching in a shader is far more expensive than a sin or cos calculation. GLSL compilers are pretty good about optimizing shaders, worrying about this is premature optimization. If you later find that, through your entire program, your shaders are the bottleneck, then you can worry about optimizing this.

If you want to take a look at the assembly code of your shader for a specific platform, I would recommend AMD GPU ShaderAnalyzer.

fospathi
  • 537
  • 1
  • 6
  • 7
Robert Rouhani
  • 14,512
  • 6
  • 44
  • 59
  • "at **an** assembly code". There is no "**the** assembly" for shaders. It changes from platform to platform. And even from driver revision to driver revision. – Nicol Bolas Apr 14 '12 at 19:52
  • A branch on a bool uniform is likely to be free of cost. I've used that technique in this type of situation when it was appropriate. – Michael Daum Apr 14 '12 at 20:51
  • @RobertRouhani Thanks for the AMD GPU ShaderAnalyzer link. – ulmangt Apr 14 '12 at 21:07
  • broken link, here's an update to the URL: http://developer.amd.com/tools-and-sdks/graphics-development/gpu-shaderanalyzer/ – miketucker Mar 26 '15 at 19:32
2

Not sure if this answers your question, but it's very difficult to tell you how many clocks/slots an instruction takes as it depends very much on the GPU. Usually it's a single cycle. But even if not, the compiler may rearrange the order of instruction execution to hide the true cost. It's certainly slower to use texture lookups for sin/cos as it is to execute the instructions.

Robinson
  • 9,666
  • 16
  • 71
  • 115
  • I don't see any mention of sincos() in the spec http://www.opengl.org/registry/doc/GLSLangSpec.Full.1.40.05.pdf what is the actual function name? Is that an extension? – ulmangt Apr 14 '12 at 19:22
  • My apologies, actually I think that might be D3D only, and even then I think the compiler implicitly generates a sin and a cos instruction for it. – Robinson Apr 14 '12 at 19:30
  • FWIW, there's an ARB Fragment instruction `SCS ` which returns sine(input.x) in the x component and cos(input.x) in the y component. – user1118321 Apr 14 '12 at 23:17
1

see how many sin's you can get in one shader in a row, compared to math.abs,frac, ect... i think a gtx 470 can handle 200 sin functions per fragment no probs, the frame will be 10 percent slower than an empty shader. it's farly fast, you can send results in. it will be a good indicator of computational efficiency.

bandybabboon
  • 2,210
  • 1
  • 23
  • 33
-1

The compiler evaluates both branches, which makes conditions quite expensive. If you use both sin and cos in your shader, you can calculate only sin(a) and cos(a) = sqrt(1.0 - sin(a)) since sin(x)*sin(x) + cos(x)*cos(x) is always 1.0

  • 3
    sin(x) + cos(x) is not generally 1.0. You're probably thinking of the identity that sin(x) * sin(x) + cos(x) * cos(x) is 1.0. While that identity can be used to calculate one value from the other, this involves a square root, which is probably just as expensive as calculating the value. So it's not really useful. Also, modern GPUs don't typically evaluate both branches as long as the condition values are the same for all fragment values that are processed together. – Reto Koradi Aug 15 '15 at 03:57
  • Yes, I was thinking of cos^2(x)+ sin^2(x) = 1 from Pythagoras' theorem. My bad. – unknownFellow Sep 26 '15 at 09:31