-1

So I'm doing a little benchmark for measuring operations per second for different operator/type combinations on C++ and now I'm stuck. My tests for +/int and +/float looks like

int int_plus(){
    int x = 1;
    for (int i = 0; i < NUM_ITERS; ++i){
        x += i; 
    }
    return x;   
}

float float_plus(){
    float x = 1.0; 
    for (int i = 0; i < NUM_ITERS; ++i){
        x += i;
    } 

    return x; 
}

And time measurement looks like

    //same for float
    start = chrono::high_resolution_clock::now();
    int res_int = int_plus();  
    end = chrono::high_resolution_clock::now();
    diff = end - start;
    ops_per_sec = NUM_ITERS / (diff.count() / 1000);

When I run tests I get

  1. 3.65606e+08 ops per second for int_plus
  2. 3.98838e+08 ops per second for float plus

But as I understand float operations is always slower than int operations, but my tests show greater value on float type. So there is the question: Am I wrong or there's something wrong with code? Or, maybe, something else?

  • 3
    With what compiler options are you building the program? If you have optimizations active, then it is possible that the compiler will optimize away the loop. You may want to inspect the assembly output of the compiler, in order to determine whether the loop has been optimized away. – Andreas Wenzel Sep 15 '21 at 21:51
  • 1
    One way you can check if the compiler is optimizing away the loop entirely is to vary `NUM_ITERS`. If you make `NUM_ITERS` 10x as large, do the operations per second stay the same or do they go up by a factor of 10 because the loop got optimized out to a product and took the same amount of time but did "more operations"? – Nathan Pierson Sep 15 '21 at 21:52
  • 1
    In addition to the uselessness of benchmarking with optimizations off and the difficulty of benchmarking loops with easily predictable side effects when optimizations are on, you don't have a floating-point version here. You have an `int+int` version and a `float+int` version. – Ben Voigt Sep 15 '21 at 21:52
  • There are probably more instructions involved into loop controlling - incrementing, checking boundary - than in the body of the loop. – Nolan Sep 15 '21 at 21:53
  • 1
    Also be careful not to make `NUM_ITERS` so large that you have `int` overflow (which being a signed integral type, is undefined behavior) – Ben Voigt Sep 15 '21 at 21:55
  • 1
    https://stackoverflow.com/questions/2550281/floating-point-vs-integer-calculations-on-modern-hardware – crashmstr Sep 15 '21 at 21:55
  • Proper benchmarking is hard and If you're not using the google benchmark library https://github.com/google/benchmark you're almost certainly doing it wrong. You can even try it online at https://quick-bench.com/ – xaxxon Sep 15 '21 at 22:08
  • Note: you are measuring the time to call the function, time for return from function call, variable allocation, and variable initialization. You may want to narrow the measurement to only the `for` loops. – Thomas Matthews Sep 15 '21 at 22:12
  • 1
    @xaxxon Proper benchmarking is hard, but I doubt Google is the only company to get it right. Your comment is a bit of a stretch. I did proper benchmarking in the past, with something different. :-) – Jeffrey Sep 15 '21 at 22:18
  • Off-topic, perhaps, but (IIRC) on x86/x64 systems, `FDIV` is an inherently faster instruction than `IDIV`. – Adrian Mole Sep 15 '21 at 23:00
  • If this optimized into scalar asm that looked like that, but kept variables in registers, you'd expect the FP version to take 4x as many clock cycles on CPUs like Skylake, with 4 cycle latency `addsd` vs. 1 cycle latency integer `add`. If you didn't enable *any* optimization, everything is getting stored/reloaded to memory and there are different bottlenecks (although you'd still expect integer to be a shorter dep chain). – Peter Cordes Sep 15 '21 at 23:53
  • 1
    Of course, clock speed isn't constant, especially if you didn't do a warm-up run before the timed region. [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987). Without details on your actual benchmark test harness and build options, we can't tell you what you did wrong here. (e.g. if you tested int first, then float, clock speed could have ramped up for the float test.) 3.6e8 ops/sec is pretty terrible for a CPU at 3 or 4e9 Hz, that's like one per 10 clocks. So for int it's likely the CPU wasn't at full clock speed. – Peter Cordes Sep 15 '21 at 23:54
  • Also note that clang/LLVM's optimizer can recognize this sum(i=0..n) pattern and use a variation on Gauss's closed-form formula, turning the loop into a couple shifts and multiplies. See Matt Godbolt's CppCon2017 talk “[What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid](https://youtu.be/bSkpMdDe4g4)” on youtube for an example, and [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116) – Peter Cordes Sep 15 '21 at 23:56
  • (And BTW, the implicit int->float conversion is independent of the addition dependency chain, so out-of-order exec can hide it easily.) – Peter Cordes Sep 16 '21 at 00:00
  • @Jeffrey rules are only rules until you know why you can break them safely. Something doesn't have to be 100% truthful for everyone to be highly useful as a lie. In this case, it's true. – xaxxon Sep 16 '21 at 00:35
  • 1
    @NathanPierson i tried make NUM_ITERS 10x as large and operations per sec goes up by ~ 8.0924e+07 – dmitrenkov Sep 16 '21 at 08:57
  • What do you mean by " my tests show greater value on float type" ? if the time of execution is greater for float, that's expected. –  Sep 22 '21 at 15:04
  • 1
    Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Sep 22 '21 at 15:05

1 Answers1

0

There's a few things that could be going on. Optimization can be part of it. Using a #define constant could be letting the compiler do who-knows-what.

Note that the loop code is also being counted. Now, that's a constant for both loops, but it's part of your time, and that means you're doing a lot of int operations, not just 1 * NUM_ITERS.

If NUM_ITERS is relatively small, then the execution time is going to be very low, and that means the overhead of a method call probably dwarfs the cost of the operations inside the method.

Optimization level will also matter.

I'm not sure what else.

Joseph Larson
  • 8,530
  • 1
  • 19
  • 36