0

I am benchmarking a software library I created in Go, and I encountered dissonance between runtime and ns/op. I am new to benchmarking, and Go's documentation and past stackoverflow questions do not conceptually cover benchmarking in depth, so I am seeking someone with more conceptual knowledge than me to help me (and other stackoverflow users in similar predicaments) understand what exactly is happening.

Benchmarking output for a task performed using native Go:

1000000000               0.6136 ns/op          0 B/op          0 allocs/op
PASS
ok      github.com/gabetucker2/gostack/benchmark        0.862s

Benchmarking output for the same task performed using my software library:

1576087               805.3 ns/op           544 B/op         21 allocs/op
PASS
ok      github.com/gabetucker2/gostack/benchmark        2.225s

Notice two things:

  • The ns/op of my software library is around 1200 times slower than the ns/op of native Go
  • The runtime of my software library is around 2 times slower than the runtime of native Go

It seems impossible to me that a very simple function from my software library should be 1200 times slower than native Go code, and it seems much more plausible that it is only 2 times slower... so what exactly is going on here?

Just in case it is useful, here are the Benchmark functions being called:

func test_Native_CreateArray() {

    myArr := []int {1, 2, 3}
    
    gogenerics.RemoveUnusedError(myArr)

}

func test_Gostack_CreateArray() {

    myStack := MakeStack([]int {1, 2, 3})
        
    gogenerics.RemoveUnusedError(myStack)

}

// native Go
func Benchmark_Native_CreateArray(b *testing.B) {
    for i := 0; i < b.N; i++ {
        test_Native_CreateArray()
    }
}
// my software library "gostack"
func Benchmark_Gostack_CreateArray(b *testing.B) {
    for i := 0; i < b.N; i++ {
        test_Gostack_CreateArray()
    }
}

Any clarity would be greatly appreciated.

gabetucker22
  • 118
  • 7
  • Please do not post images of text. – Volker Oct 27 '22 at 07:23
  • 1
    And: 0.5ns/op means nothing happened wich means the compiler optimized your benchmark loop away. – Volker Oct 27 '22 at 07:25
  • 1
    For the runtime, (as far as I know) it counts the total runtime. This means that the first example executed 1_000_000_000 times with an avarage of 0.6ns/op this would be 0.6 seconds of runtime of the total 0.85 seconds. The second benchmark only runs 1_694_395 times and needs 711ns/op because of that it does only have a runtime of 2 seconds. Would you force the benchmark to also run 1_000_000_000 times the runtime should match the 1200x slower result. – Nordiii Oct 27 '22 at 07:27
  • @Volker My mistake—I have replaced my image with text. Can you expand on why it optimized my benchmark loop away? – gabetucker22 Oct 27 '22 at 07:29
  • @Nordiii Thank you for letting me know—that explains it. If you post your comment in the form of an answer, I will mark it as correct. One follow-up question based on your answer: is it possible to set the number of iterations from within the Benchmark function, or must you do this in the terminal window using `-benchtime=x`? – gabetucker22 Oct 27 '22 at 07:36
  • Because the program behaves exactly the same no matter if your test_Native_CreateArray is executed or not. (And please: Use proper names) – Volker Oct 27 '22 at 07:38
  • @Volker test_Native_CreateArray creates an array, which consumes memory and, by extension, running time... what am I missing? And sorry, but I am unsure of what you mean by proper names. – gabetucker22 Oct 27 '22 at 07:41
  • @gabetucker22 no sorry, I dont know if there are other ways than the command line – Nordiii Oct 27 '22 at 07:57
  • 2
    @gabetucker22, look at the output. The first benchmark clearly does not allocate any memory. RemoveUnusedError is presumably an empty function and it should be obvious that removing this call doesn't change the program's behavior. This leaves only the definition of the slice literal, which the compiler knows has no side-effects (in contrast to the MakeStack call) and can therefore be removed as well. Your benchmark loop is empty. Benchmarking things that have no observable effect is totally pointless. – Peter Oct 27 '22 at 08:53
  • 1
    With some more experience you will learn that you can't get anything useful done in half a nanosecond (on commodity hardware at least). Benchmark results like this always show that the benchmark is flawed. – Peter Oct 27 '22 at 08:55
  • 1
    @Peter: Half a nanosecond is about 2 clock cycles. Max throughput for typical modern CPUs is 2x 8 single-precision FMA per clock cycle with AVX, so you can get some useful work done in well optimized code. But not allocating memory from the heap. Probably the repeat loop fully optimized away, but it's also possible that there's a 64-bit store and a 32-bit store to initialize a 3-element `int` array on the stack. (Missed optimization vs. padding to 128-bit for one SIMD store). That would account for it taking 2 clock cycles per iteration instead of 1 for an empty loop. – Peter Cordes Oct 27 '22 at 11:47
  • Nearly the same question was asked less than 12 hours before this: [Go benchmark. Why function with opearions could have same perfomance as empty function?](https://stackoverflow.com/q/74213896) . See my comment there and [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) - microbenchmarking is hard, you need to carefully construct something that repeats the operation you want to measure, in a way that the compiler doesn't optimize away the repeat but does optimize away other overhead. – Peter Cordes Oct 27 '22 at 13:07
  • 1
    @PeterCordes Your explanations here and in [your other answer](https://stackoverflow.com/q/74213896) have been very useful—thank you. – gabetucker22 Oct 27 '22 at 16:11

1 Answers1

2

The first function ran 1_000_000_000 times with 0.61ns/op which is 0.61 seconds of the total runtime which took 0.862 seconds.

The second function ran 1_576_087 time with 805ns/op this takes around 1.26875 seconds of the 2.225 seconds. Forcing the second function to run 1_000_000_000 times should end up with around 805 seconds + overhead.

Nordiii
  • 446
  • 1
  • 3
  • 10