For recent pentium microarchitectures, a sqrt
has a latency of 10-22 cycles (to compare to 3cy for a fp add
, 5cy for a fp mult
and 2-4cy for type conversion fp-int). The cost is significantly higher, especially as sqrt
is not pipeline and it only possible to start a new operation every 5 cycles.
But adding a test may not be a good idea, as the test also has a cost that must be considered. In modern processors with deep pipeline, instructions are fetched in advance to fill the pipeline and a branch may require to forget all these fetched instructions. To limit this nasty effect, processors try to "predict" the behavior of tests: Are branchs taken or not and what is the target address? Prediction is based on the regularity of the program behavior. Present predictors are very good and for many problems a branch does not have not a significant cost if properly predicted.
But predictions can fail and a mispredict cost 15-20 cycles, which is very high.
Now try to evaluate roughly what would be the gain of the modification that you propose. We can consider several scenarios.
90% of the time value is != 1.0 and 10% of the time it is equal to 1.0. Based on this behavior, branch predictors will bet that you do not take the branch (value!=1.0).
So 90% of the time you have a normal sqrt to compute (and the test cost is negligible) and 10% of the time, you have a mispredict. You avoid the 10-20 cycles sqrt, but you pay 15 cycles branch penalty. The gain is null.
90% of the time value is = 1.0 and 10% of the time it is different. Branch predictors will assume that you take the branch.
When value is 1.0, you have a clear win and the cost is almost null. 10% of the time you will pay a branch mispredict and a sqrt. Compared to 100% sqrt, on the average, there is a win.
50% of values are 1.0 and 50% are different. This is somehow a disaster scenario. Branch predictors will have great difficulties to find a clear behavior of the branch and may fail a significant fraction of the time, say 40% to 100% if you are very unlucky. You will add many branch mispredicts to your computational cost and you may have a negative gain!!!
These estimations are very rough and would require a finer computation with a model of your data, but probably except when a large part of your data is 1.0, you will have at best no gain, and you may even have a slowdown.
You can find measures of the cost of operations in the site of Agner Fog https://www.agner.org/optimize