11

What is the difference in CPU cycles (or, in essence, in 'speed') between

 x /= y;

and

 #include <cmath>
 x = sqrt(y);

EDIT: I know the operations aren't equivalent, I'm just arbitrarily proposing x /= y as a benchmark for x = sqrt(y)

Charles
  • 50,943
  • 13
  • 104
  • 142
Matt Munson
  • 2,903
  • 5
  • 33
  • 52
  • What....? How are these related? – duffymo Jul 30 '11 at 16:11
  • 3
    It highly depends on compiler, configuration and target CPU. – Yakov Galka Jul 30 '11 at 16:12
  • Huh? Those are entirely different operations. What's the difference between `printf` and `new`? In any case, just compare the assembly! – Kerrek SB Jul 30 '11 at 16:12
  • I imagine the performance of `sqrt` is dependent at least partly on the value of `x` – Cameron Jul 30 '11 at 16:13
  • Even with `y`, these snippets are still completely different. – Cameron Jul 30 '11 at 16:14
  • 1
    While comparing two different operations may sound strange, it is definitely possible (even if platform depeding and quitee difficult to do it right). Knowing approximate relative speed of basic floating point operations is important when doing low-level optimizations. Sometimes you can solve the same problem e.g (artificial example) either by multiplying 4 times and dividing 3 times, or multiplying 2 times and performing square root 2 times. – Suma Jul 30 '11 at 16:20
  • 1
    @Mat a point of reference. If I have experience with the speed of one thing (because it is ubiquitous), then I can assess the speed of another thing when I know its speed relative to the first. – Matt Munson Jul 30 '11 at 16:21
  • @Suma: You cannot compare the two unless you're given a concrete implementation of the division and the sqrt. Even if both are translated to x84 machine code, their performance may be different with different processors (e.g. old vs new). – Yakov Galka Jul 30 '11 at 16:22
  • 3
    Guys, while not completetly clear, I believe this to be a real question. @Matt: on less powerful systems that don't have dedicated hardware, sqrt is generally x10 slower than div. On any hardware from this decade, they are very close, and quite often get pipelined together into similiar floating point performance. You can search for CPU timings on your particular processor to get a better feel. – Michael Dorgan Jul 30 '11 at 16:25
  • 2
    Here http://www.friweb.hu/instlatx64/ you can find measured timings of all x86 instructions (ns and ticks). E.g. for Core 2 Duo E6700 latency (L) of x87 sqrt operation is 29 ticks for 32-bit float; 58 ticks for 64-bit double and 69 ticks for 80-bit long double; SSE/SSE2 time for 32/64 bit packed floating point are the same (29 and 58 ticks). For F.P. Divide: 32bit=18clock; 64bit=32clock; 80bit=38 ticks; 32/64bit the same for x87 and SSE/SSE2. In your operation there is loading and storing a value, which must be counted additionally. This should be The answer, but some closed this good Q. – osgx Jul 30 '11 at 16:27
  • @ybun I'm sorry if the question is ambiguous, but I'm just trying to get a general sense of how much sqrt() may slow down my code. – Matt Munson Jul 30 '11 at 16:28
  • That last comment doesn't really make sens. If you need a square root, you'll have to pay the cost of calculating it. `sqrt` doesn't "slow down your code". Whether `sqrt` will be "fast" or not depends on your compiler & your platform. – Mat Jul 30 '11 at 16:31
  • 2
    @Mat But in some situations calculating a square root can be avoided. – Matt Munson Jul 30 '11 at 16:34
  • This is a ridiculous question - there's no data or basis for it whatsoever. It's far more likely that the rest of Matt's code will be sinking him, not square root. I'd vote to close. – duffymo Jul 30 '11 at 20:36

3 Answers3

11

The answer to your question depends on your target platform. Assuming you are using most common x86 cpus, I can give you this link http://instlatx64.atw.hu/ This is a collection of measured instruction latency (How long will it take to CPU to get result after it has argument) and how they are pipelined for many x86 and x86_64 processors. If your target is not x86, you can try to measure cost yourself or consult with your CPU documentation.

Firstly you should get a disassembler of your operations (from compiler e.g. gcc: gcc file.c -O3 -S -o file.asm or via dissasembly of compiled binary, e.g. with help of debugger). Remember, that In your operation there is loading and storing a value, which must be counted additionally.

Here are two examples from friweb.hu:

For Core 2 Duo E6700 latency (L) of SQRT (both x87, SSE and SSE2 versions)

  • 29 ticks for 32-bit float; 58 ticks for 64-bit double; 69 ticks for 80-bit long double;

of DIVIDE (of floating point numbers):

  • 18 ticks for 32-bit; 32 ticks for 64-bit; 38 ticks for 80-bit

For newer processors, the cost is less and is almost the same for DIV and for SQRT, e.g. for Sandy Bridge Intel CPU:

Floating-point SQRT is

  • 14 ticks for 32 bit; 21 ticks for 64 bit; 24 ticks for 80 bit

Floating-point DIVIDE is

  • 14 ticks for 32 bit; 22 ticks for 64 bit; 24 ticks for 80 bit

SQRT even a tick faster for 32bit.

So: For older CPUs, sqrt is itself 30-50 % slower than fdiv; For newer CPU the cost is the same. For newer CPU, cost of both operations become lower that it was for older CPUs; For longer floating format you needs more time; e.g. for 64-bit you need 2x time than for 32bit; but 80-bit is cheapy compared with 64-bit.

Also, newer CPUs have vector operations (SSE, SSE2, AVX) of the same speed as scalar (x87). Vectors are of 2-4 same-typed data. If you can align your loop to work on several FP values with same operation, you will get more performance from CPU.

Paolo M
  • 12,403
  • 6
  • 52
  • 73
osgx
  • 90,338
  • 53
  • 357
  • 513
  • I'm sure its implied, but I'm assuming that sqrt takes advantage of these CPU optimizations? – Matt Munson Jul 30 '11 at 17:00
  • 1
    C++ `cmath` uses same `sqrt()` as C version of `math.h`. But internally `sqrt()` may have a bit more then just `FSQRT` asm code, e.g. error handling. Also, sometimes gcc will not inline call to `sqrt()` in place of call, so overhead of function call will be here. You need to check disassembler of YOUR function and grep it for machine codes with "sqrt" in their names. Also try option `-ffast-math`. – osgx Jul 30 '11 at 21:05
5

If the square root function isn't implemented in special hardware or software, most library functions would calculate it using Newton's method, which converges quadratically.

Newton's method is an iterative method: you make an initial guess, calculate a trial result, and use that for the next guess. You repeat until you think you have a result that's "close enough." It so happens that you can prove how many iterations you need with square root. Every time through the cycle you get another two digits of accuracy, so most implementations will converge to the precision limit of doubles in 8-9 cycles.

If you read this carefully, you'll see that the iterative Newton's method is doing two subtractions, one multiplication, and one division per iteration.

duffymo
  • 305,152
  • 44
  • 369
  • 561
  • Could you explain "converges quadratically"? – Kerrek SB Jul 30 '11 at 16:13
  • @duffymo So does implement SQRT using Newtons method, or does it take advantage of the CPU optimizations that others have mentioned? – Matt Munson Jul 30 '11 at 16:58
  • 2
    This question is numerical methods. It belongs here. @Matt, I don't know about your particular implementation. Your C++ compiler might insert instructions for a machine optimized version. – duffymo Jul 30 '11 at 17:30
  • @duffy oh wow, so thats ~8 * (2 subs., 1 mult., and one div.). Unless the CPU is optimizing considerably beyond that, that's more computation than I was hoping. Thanks. – Matt Munson Jul 30 '11 at 20:30
  • You're being foolish. This is a micro-optimization. Do you have any data to suggest that the majority of your application's time is being spent calculating square roots? If not, forget about it until you do. I doubt that you have any data whatsoever to motivate this question. – duffymo Jul 30 '11 at 20:35
  • @duffy Well, the square roots are being calculated recursively. The application is in neural networks, so the program just runs the same loop repeatedly, feeding new input to the network at each iteration. Sure this is micro-optimization, but if I micro-optimize every aspect of the algorithm, which is practical given the relative shortness of the code, I think it could lead to worthwhile gains. – Matt Munson Jul 31 '11 at 04:48
  • 1
    The key there is "I think". Measure it - profile your code and be sure. You might be surprised by the outcome. – duffymo Jul 31 '11 at 14:51
  • @duffy What's the best way to do that? If I run the program for long enough, can I just measure how many iterations it completes in x amount of time? Or will I need some particular kind of software to actually measure the compute time? – Matt Munson Jul 31 '11 at 16:16
  • 1
    @KerrekSB Quadratic convergence means, roughly, that each iteration the number of digits of accuracy doubles. Eg., iteration 1 error has 0.1, iteration 2 has error 0.01, iteration 3 has error 0.001, iteration 4 has error 0.00001, iteration 5 has error 0.000000001. – Nick Alger Jun 27 '16 at 09:18
2

As a general rule of thumb: Both floating point division and square root are considered as slow operation (compared to fast ones like addition or multiplication). Square root can be expect to be approximately the same speed or somewhat slower (i.e. approx. 1x - 2x lower performance) compared to a division. E.g. on Pentium Pro

Division and square root have a latency of 18 to 36 and 29 to 69 cycles, respectively

To get more accurate answer, you need to dig into architecture manual for your platform or perform a benchmark.

Note: many modern platforms also offer inverse square root, which has the speed approximately the same as sqrt, but is often more useful (e.g. by having invsqrt you can compute both sqrt and div with one multiplication for each).

Suma
  • 33,181
  • 16
  • 123
  • 191
  • For sandy bridge from intel both operations take exactly the same time . So, now, sqrt is not 2x slower than div – osgx Jul 30 '11 at 16:36
  • OK. Adjusted. It would be possible to include exact timings for many platforms, but I think the question wants to have just a "gut feeling", in the rare situations you really need exact data it is more important to know where or how you can get them. – Suma Jul 30 '11 at 16:47
  • Two exact examples give some feeling to me. – osgx Jul 30 '11 at 16:50