How much does it save to check before calling sqrt when normalizing a 3d vector

Question

AFAIK the sqrt operation is expensive in most situations. Is the test below for the vector already being a length of 1 with epsilon worth it? Does it save much. If normalize is called often on vectors that are already normalized. If not is it too expensive ?

double Vec3d::normalize() {
    double  mod = x * x + y * y + z * z;
    if (mod == 0) {
        return(0);
    }
    if (consideredEqual(mod, 1.0, .0000001)) {  // is this test worth it ???
        return(1.0);
    }
    mod = std::sqrt(mod);
    x /= mod;
    y /= mod;
    z /= mod;
    return mod;
}

Whenever asking about performance here - first benchmark code yourself. — , Feb 22 '19 at 18:49
This depends on the statistical distribution of **your** inputs. How do you expect us to know that? — MSalters, Feb 22 '19 at 18:53
quite hard in this environment due to cache hits etc. I figure many many people have done this on common platforms because all 3d graphics needs this, so many would have an instant yes/no answer. Otherwise I have to write a special test program etc etc. Once the answer is posted then hopefully other people with the same question can find it easily ( I did not ) — peterk, Feb 22 '19 at 18:56
for me typical intel 64 bit PC with math coprocessor and standard C++ libraries. — peterk, Feb 22 '19 at 18:57
The best way to know is to write a benchmark. Frankly, it's probably faster than writing a question and waiting for responses. — Joseph Larson, Feb 22 '19 at 21:23

score 1 · Accepted Answer · answered Feb 22 '19 at 19:55

For recent pentium microarchitectures, a sqrt has a latency of 10-22 cycles (to compare to 3cy for a fp add, 5cy for a fp mult and 2-4cy for type conversion fp-int). The cost is significantly higher, especially as sqrt is not pipeline and it only possible to start a new operation every 5 cycles.

But adding a test may not be a good idea, as the test also has a cost that must be considered. In modern processors with deep pipeline, instructions are fetched in advance to fill the pipeline and a branch may require to forget all these fetched instructions. To limit this nasty effect, processors try to "predict" the behavior of tests: Are branchs taken or not and what is the target address? Prediction is based on the regularity of the program behavior. Present predictors are very good and for many problems a branch does not have not a significant cost if properly predicted. But predictions can fail and a mispredict cost 15-20 cycles, which is very high.

Now try to evaluate roughly what would be the gain of the modification that you propose. We can consider several scenarios.

90% of the time value is != 1.0 and 10% of the time it is equal to 1.0. Based on this behavior, branch predictors will bet that you do not take the branch (value!=1.0).
So 90% of the time you have a normal sqrt to compute (and the test cost is negligible) and 10% of the time, you have a mispredict. You avoid the 10-20 cycles sqrt, but you pay 15 cycles branch penalty. The gain is null.
90% of the time value is = 1.0 and 10% of the time it is different. Branch predictors will assume that you take the branch.
When value is 1.0, you have a clear win and the cost is almost null. 10% of the time you will pay a branch mispredict and a sqrt. Compared to 100% sqrt, on the average, there is a win.
50% of values are 1.0 and 50% are different. This is somehow a disaster scenario. Branch predictors will have great difficulties to find a clear behavior of the branch and may fail a significant fraction of the time, say 40% to 100% if you are very unlucky. You will add many branch mispredicts to your computational cost and you may have a negative gain!!!

These estimations are very rough and would require a finer computation with a model of your data, but probably except when a large part of your data is 1.0, you will have at best no gain, and you may even have a slowdown.

You can find measures of the cost of operations in the site of Agner Fog https://www.agner.org/optimize

thanks - kind of the "gut level" answer I was seeking. And running a real test of course means one needs to know the data distribution which depends on the application and the application isn't built yet. I was thinking about it when running a ray through a bunch of transforms and needing the normalized version of the direction at each transformation. Seems sqrt would be about 20% of the the spent on that operation if each all the transforms were scaled. It's the kind of question that one with lots of practical 3d graphics experience would have in their gut from testing in many contexts. — peterk, Feb 24 '19 at 16:26

peterk · Answer 2 · 2019-02-22T19:15:25.727

-1

Well answers to the question below indicate that it is NOT worth it for a general use with standard C libraries and compilers and current processors with fpus. But that it might be marginally worth it in known limited situations or on processors without float support.

c++ practical computational complexity of <cmath> SQRT()

edited Feb 22 '19 at 19:15

answered Feb 22 '19 at 19:09

peterk

5,136
6
33
47

How much does it save to check before calling sqrt when normalizing a 3d vector

2 Answers2