2

I had a small benchmark to check how much faster/slower adding ints is in comparison to longs. My assumption was that int should be faster as on x64 two of them fit in one CPU register (in contrast to a 64-bit wide long). To my surprise, they behave more or less the same.

But most surprising is the fact, that adding integers and returning a long was the fastest on my machine (MacBook M1 Pro - so an ARM chip).

private const int Iterations = 1_000_000;

[Benchmark(Baseline = true)]
[Arguments(10, 20, 30)]
public int AddIntReturnInt(int a, int b, int c)
{
    int result = 0;
    for (var i = 0; i < Iterations; i++)
        result += a + b + c;

    return result;
}

[Benchmark]
[Arguments(10, 20, 30)]
public long AddIntReturnLong(int a, int b, int c)
{
    long result = 0;
    for (var i = 0; i < Iterations; i++)
        result += a + b + c;

    return result;
}

[Benchmark]
[Arguments(10L, 20L, 30L)]
public long AddLongReturnLong(long a, long b, long c)
{
    long result = 0;
    for (var i = 0; i < Iterations; i++)
        result += a + b + c;

    return result;
}

Results:

BenchmarkDotNet=v0.13.2, OS=macOS Monterey 12.6.1 (21G217) [Darwin 21.6.0]
Apple M1 Pro, 1 CPU, 10 logical and 10 physical cores
.NET SDK=7.0.100
  [Host]     : .NET 7.0.0 (7.0.22.51805), Arm64 RyuJIT AdvSIMD
  DefaultJob : .NET 7.0.0 (7.0.22.51805), Arm64 RyuJIT AdvSIMD


|            Method |  a |  b |  c |     Mean |   Error |  StdDev | Ratio |
|------------------ |--- |--- |--- |---------:|--------:|--------:|------:|
|   AddIntReturnInt | 10 | 20 | 30 | 935.5 us | 2.09 us | 1.95 us |  1.00 |
|  AddIntReturnLong | 10 | 20 | 30 | 318.1 us | 0.74 us | 0.61 us |  0.34 |
| AddLongReturnLong | 10 | 20 | 30 | 933.6 us | 2.12 us | 1.98 us |  1.00 |

My question is how one can explain this behavior. Even the IL-code isn't smaller when returning a long (like fewer bound-checks and stuff).

EDIT 1: I updated the benchmark to run 1 million times instead of only 1000.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
Link
  • 1,307
  • 1
  • 11
  • 23
  • @shingo, out of curiosity: Why? Isn't Benchmark.NET supposed to run every benchmarked method multiple times anyway? –  Nov 25 '22 at 12:16
  • I don't know if this is normal for the Apple M1 Pro CPU or the Arm64 RyuJIT results, bit i find it quite a bit puzzling that both AddIntReturnInt and AddLongReturnLong have an error and StdDev that's an order of magnitude larger than AddIntReturnLong. Could your Mac perhaps running other software/services in the background that could have loaded the CPU intermittently and therefore spoiled your benchmark runs? –  Nov 25 '22 at 12:22
  • 1
    @MySkullCaveIsADarkPlace even the largest value 925.9 ns/1000 is less than 1ns, it's almost near a CPU cycle, any perturbation during the benchmark is enough to influence the result. The error of the first and last method is far more than the middle one, which also means it's not stable. – shingo Nov 25 '22 at 12:25
  • @shingo, ah i see. Thanks a lot for the response and explanation! –  Nov 25 '22 at 12:29
  • 1
    I updated the benchmark to run 1 million times instead of only 100. The results are roughly the same. – Link Nov 25 '22 at 13:23
  • 2
    AddIntReturnLong() is faster because [the optimizer](https://stackoverflow.com/a/4045073/17034) can do a better job. Two optimizations apply, common subexpression elimination and code hoisting. It can compute a+b+c just once and do it before the loop is entered. So the loop body degenerates to a single long addition. This is possible because that addition cannot overflow. – Hans Passant Nov 25 '22 at 15:26

1 Answers1

1

Adding integers is one of the simplest operation a processor can do, and your processor likely have multiple full width adders, so there is no reason why adding 64 bit values should take any longer than 32 bit values. Simple code like this would typically be limited by memory and other factors rather than actual compute performance.

My guess as to why AddIntReturnLong is the fastest is that the compiler or processor can do some optimization not otherwise possible due to the mixed long/int adding. It may allow optimization about things like overflow behavior or allow better instruction level parallelism. But I'm not sufficiently well versed in assembler or ARM to say anything for sure.

But tests like this would be super sensitive to compiler optimization, and I'm not sure the result is useful. If you need to optimize at this level you typically need to consider cache behavior and/or use SIMD to get improvements.

JonasH
  • 28,608
  • 2
  • 10
  • 23
  • Very fair points. This benchmark in particular doesn't make that much sense (especially that isolated). I was just curious how it can happen that the mixed approach is 3x faster. – Link Nov 25 '22 at 13:24