1,000,000,000 calculations per microsecond?

Question

OK, I've been talking to a friend about compilers and optimisation of programs, and he suggested that n * 0.5 is faster than n / 2. I said that compilers do that kind of optimisation automatically, so I wrote a small program to see if there was a difference between n / 2 and n * 0.5:

Division:

#include <stdio.h>
#include <time.h>

int main(int argc, const char * argv[]) {
    int i, m;
    float n, s;
    clock_t t;

    m = 1000000000;
    t = clock();
    for(i = 0; i < m; i++) {
        n = i / 2;
    }
    s = (float)(clock() - t) / CLOCKS_PER_SEC;

    printf("n = i / 2: %d calculations took %f seconds (last calculation = %f)\n", m, s, n);

    return 0;
}

Multiplication:

#include <stdio.h>
#include <time.h>

int main(int argc, const char * argv[]) {
    int i, m;
    float n, s;
    clock_t t;

    m = 1000000000;
    t = clock();
    for(i = 0; i < m; i++) {
        n = i * 0.5;
    }
    s = (float)(clock() - t) / CLOCKS_PER_SEC;

    printf("n = i * 0.5: %d calculations took %f seconds (last calculation = %f)\n", m, s, n);

    return 0;
}

And for both versions I got 0.000002s avg. when compiled with clang main.c -O1. And he said there must be something wrong with the time measurement. So he then wrote a program:

#include <cstdio>
#include <iostream>
#include <ctime>

using namespace std;

int main()
{
    clock_t ts, te;
    double  dT;

    int i, m;
    double n, o, p, q, r, s;
    m = 1000000000;

    cout << "Independent calculations:\n";
    ts = clock();
    for (i = 0; i < m; i++)
    {
        //  make it a trivial pure float calculation with no int casting to float
        n = 11.1 / 2.3;
        o = 22.2 / 2.3;
        p = 33.3 / 2.3;
        q = 44.4 / 2.3;
        r = 55.5 / 2.3;
        s = 66.6 / 2.3;
    }

    te = clock();
    dT = ((float)(te - ts)) / CLOCKS_PER_SEC;   //  make initial call to get the elapsed time to run the loop
    ts = clock();

    printf("Division: %d calculations took %f seconds\n", m, dT);

    for (i = 0; i < m; i++)
    {
        //  make it a trivial pure float calculation with no int casting to float
        n = 11.1 * 0.53;
        o = 22.2 * 0.53;
        p = 33.3 * 0.53;
        q = 44.4 * 0.53;
        r = 55.5 * 0.53;
        s = 66.6 * 0.53;
    }

    te = clock();
    dT = ((float)(te - ts)) / CLOCKS_PER_SEC;   //  make initial call to get the elapsed time to run the loop
    ts = clock();

    printf("Multiplication: %d calculations took %f seconds\n", m, dT);

    cout << "\nDependent calculations:\n";
    for (i = 0; i < m; i++)
    {
        //  make it a trivial pure float calculation with no int casting to float
        n = 11.1 / 2.3;
        o = n / 2.3;
        p = o / 2.3;
        q = p / 2.3;
        r = q / 2.3;
        s = r / 2.3;
    }


    te = clock();
    dT = ((float)(te - ts)) / CLOCKS_PER_SEC;   //  make initial call to get the elapsed time to run the loop
    ts = clock();

    printf("Division: %d calculations took %f seconds\n", m, dT);

    for (i = 0; i < m; i++)
    {
        //  make it a trivial pure float calculation with no int casting to float
        n = 11.1 * 0.53;
        o = n * 0.53;
        p = o * 0.53;
        q = p * 0.53;
        r = q * 0.53;
        s = r * 0.53;
    }


    te = clock();
    dT = ((float)(te - ts)) / CLOCKS_PER_SEC;   //  make initial call to get the elapsed time to run the loop
    ts = clock();

    printf("Multiplication: %d calculations took %f seconds\n", m, dT);

    return 0;
}

And for that he got...

...in that order.

So I ran the program on my machine compiled with clang++ main.cpp -O1 and I got similar results as before: 0.000002 to 0.000011.

However, when I compiled the program without optimisation, I got similar results to him on his first test. So my question is, how can any amount of optimisation make the program that much faster?

so there are some serious issues with your benchmark programs. That will break anything you're trying to test anyway. Most notably you're mixing types and have type coersion going on which is VERY expensive in and of itself (and could be what your friend was referring to). Also there is nothing here a compiler can't calculate at compile time. — Mgetz, Feb 18 '15 at 20:00
`n = i * 0.5;` promotes `i` to `double`, multplies by .5 and converts back to `float`. `n = i / 2;` divides (probably shifts) `i` by 2, then converts to double. You are not testing multiply vs. divide, but unrelated operations. — David Rodríguez - dribeas, Feb 18 '15 at 20:04
If you want your "benchmarks" to not get optimized away, try accumulating the result within each iteration in the `s` variable. Then print the final value of `s` after you've made the timing calculation. — Praetorian, Feb 18 '15 at 20:04
To add to David's correct point: even if you did `i / 2.0` and `i * 0.5`, the compiler is not (legally) allowed to optimize the divide into a multiplication. [Floating point operations need to be performed as specified](http://stackoverflow.com/questions/6430448/why-doesnt-gcc-optimize-aaaaaa-to-aaaaaa). (although you might be able to argue that `2.0` and `0.5` are both perfectly representable by IEEE-754, so the compiler *might* be able to do this optimization in this specific case, but in general it can't) — Cornstalks, Feb 18 '15 at 20:10
I seriously doubt you are getting *"1,000,000,000 calculations per microsecond!"* — Weather Vane, Feb 18 '15 at 20:11
to explain your friends attempt : multiplication is ALWAYS faster than division - BUT since modern hardware is massively optimized and has dedicated circuits for different purposes --- together with MANY OS-optimizations of various kind its pretty much impossible to make a statement about arithmetic speed, you can only measure it - and the results will never be general results, they will ALWAYS be specific ones, benchmarks for your current hardware & software. The exact same testing-method WILL yield different results on different machines, even contradicting ones. Its only theory, not reality — specializt, Feb 18 '15 at 20:17
@thephpdev: Don't completely change the question after you've gotten answers. Ask a new question; otherwise you invalidate the answers already given. — wolfPack88, Feb 18 '15 at 20:26
[Fixed sandbox app that explores the issue](http://coliru.stacked-crooked.com/a/ba74dc83a98bde77) — Mgetz, Feb 18 '15 at 20:39
i dont see how your edit changes the fact that digital multiplication is always faster than division if measured for the exact same, non-specialized, arithmetic circuit ... — specializt, Feb 18 '15 at 20:51
If you compile with `-ffast-math`, gcc will favor fast operations over strict adherence to the standard. Substituting division with a reciprocal approximation should be one such optimization. — Red Alert, Feb 18 '15 at 21:09

score 19 · Accepted Answer · answered Feb 18 '15 at 20:00

19

Since the code doesn't do anything differently in each iteration of the loop, the compiler is free to move the code within the loop outside (the result will be exactly the same), and remove the loop entirely, leaving you with an almost 0 run-time, as you saw.

answered Feb 18 '15 at 20:00

wolfPack88

4,163
4
32
47

Just to be clear, I think @wolfPack88 is saying basically all your code evaluates is '(float)((m-1)>>1)'. You should probably look at the asm. – dwn Feb 18 '15 at 20:11
2

@dwn: No, look at the third code block (the one he timed with and without optimization). He has a loop that calculates something 1000000 times, but the calculation is exactly the same. The compiler is smart enough to notice that, and calculate it only once (perhaps even at compile time), leaving the executable with very little to do. – wolfPack88 Feb 18 '15 at 20:14
Oh, then I think I disagree; the variable 'n' is never referenced in the loop, so it is probably just ignored. – dwn Feb 18 '15 at 20:17
@dwn: That's more or less what I said... the compiler is smart enough to recognize that things aren't being used, and it gets optimized away. That's how optimization can have such a large effect. – wolfPack88 Feb 18 '15 at 20:19
@woldPack88 Okay, sounds like we agree more or less – dwn Feb 18 '15 at 20:19
OK, it appears that was happening. – thephpdev Feb 18 '15 at 20:26

score 5 · Answer 2 · answered Feb 18 '15 at 20:04

for (i = 0; i < m; i++)
{
    //  make it a trivial pure float calculation with no int casting to float
    n = 11.1 * 0.53;
    o = n * 0.53;
    p = o * 0.53;
    q = p * 0.53;
    r = q * 0.53;
    s = r * 0.53;
}

is a loop that does not reference i or m and contains no circular references so it is trivial for the compiler to remove the looping statement

harold · Answer 3 · 2015-02-24T08:33:31.597

This is a good example of how benchmarking a high level language is even harder than benchmarking assembly (which is hard enough to get right already). The compiler can, and often will, interfere with your benchmark.

Your friend has a point, a division (actual division, not just writing / in C) is slower than a multiplication. For doubles, the ratio is about 4 for the latency, and division isn't pipelined whereas multiplication is, so the throughput ratio is much worse: around 20. (these numbers are for Haswell, but are typical)

Integer division is slower than floating point division, but using floating point multiplication on integer causes two conversions. The conversions aren't too bad, so in total, the floating point multiplication is still faster.

But any proper compiler will turn integer division by a constant into integer multiplication and a shift, and maybe some extra fix-up stuff (depending on the divisor and the type). Division by a power of two is even simpler. For details, see Division by Invariant Integers using Multiplication. As an example, consider

int div2(int i)
{
    return i / 2;
}

GCC turns this into

mov eax, edi
shr eax, 31
add eax, edi
sar eax
ret

Which depending on the µarch, would take 3 or 4 cycles (excluding control flow).

On the other hand,

int div2(int i)
{
    return i * 0.5;
}

Is turned into

    cvtsi2sd    xmm0, edi
    mulsd   xmm0, QWORD PTR .LC0[rip]
    cvttsd2si   eax, xmm0
    ret
.LC0:
    .long   0
    .long   1071644672

Which would take about 4+5+4 = 13 cycles.

A compiler is also allowed to turn f / 2.0 into f * 0.5 even without any "unsafe optimizations", division by a power of two is equivalent to multiplication by its inverse. That does not hold for numbers that are not a power of two.

So even with a benchmark that at least measured something, it probably wouldn't have measured the right thing. In order to measure the latency floating point division vs multiplication, you could write something like:

.section data
align 16
one: dq 1.0, 1.0
.section text
_bench1:
  mov ecx, -10000000
  movapd xmm0, [one]
loopone:
  mulpd xmm0, xmm0
  mulpd xmm0, xmm0
  add ecx, 1
  jnz loopone
  ret
_bench2:
  mov ecx, -10000000
  movapd xmm0, [one]
looptwo:
  divpd xmm0, xmm0
  divpd xmm0, xmm0
  add ecx, 1
  jnz looptwo
  ret

Call both a thousand, wrapped in rdtsc to get the time. Take the lowest time for both. Multiply the time by the ratio between your base clock and turbo clock. Then you should have (approximately) the number of cycles both loops take, divide by 20000000 to get the number of cycles per mulpd or divpd. The time division takes depends on the values being divided, so it doesn't give the most general result.

You may also be interested in a list of instruction latencies and throughputs.

1,000,000,000 calculations per microsecond?

3 Answers3