0

While playing with some simple code in Go I noticed that using a bool array instead of int array(which uses only values of 0/1) has a pretty significant speed-up.

  • funcUsingBool - 1.397s
  • funcUsingInt - 1.996s

I would have expected both of them to give the same performance since there isn't a native bool type at machine level, so I would have expected the compiler to generate similar assembly code.

Since the difference is pretty big, I'm skeptical on the validity of this result.

I'm building using the command "go build filename.go", but I'm not sure what's the equivalent flag of gcc's "-O3".

func funcUsingBool(n int) int {
    if n < 1 { return 0 }

    notPrime := make([]bool, n+1)
    count := 1
    for i := 3; i < n; i = i + 2 {
        if notPrime[i] { continue }
        count++
        k := 2 * i
        for k <= n {
            notPrime[k] = true
            k += i
        }
    }
    return count
}

func funcUsingInt(n int) int {
    if n < 1 { return 0}

    notPrime := make([]int, n+1)
    count := 1
    for i := 3; i < n; i = i + 2 {
        if notPrime[i] == 1 { continue }
        count++
        k := 2 * i
        for k <= n {
            notPrime[k] = 1
            k += i
        }
    }
    return count
}
twosan
  • 69
  • 5
  • 2
    Why wouldn't you expect some difference when using different sized values? – JimB Jun 30 '17 at 14:17
  • 2
    There's no native bool type, true, but that still makes a bool 1 byte and an int 4 or 8 bytes. – Adrian Jun 30 '17 at 14:21
  • @JimB I had timed the "allocation" of the slices to 3&28ms, so I assumed the variable size to be not to be significant in the overall processing time. Now that I think of it, I agree that cache misses could be more frequent for the Int case, but I'm not sure if this explains it all. I saw a [c++ thread](https://stackoverflow.com/questions/5764956/which-is-faster-if-bool-or-ifint) and there, at least, at assembly level the generated code was more or less the same. – twosan Jun 30 '17 at 14:56
  • @Adrian true but for the 1 byte value you would have to pack&unpack it, since the processor works on 4 or 8 bytes. – twosan Jun 30 '17 at 14:56
  • 1
    Surprised by the score of the question(negative). Serious question: Am I bringing down the level of discussion fellow "GO-ers" expect with a trivial topic and should ask such questions somewhere else or is my phrasing poor and lacking? – twosan Jun 30 '17 at 15:06
  • Your question has no question, just a statement that you've proven empirically yet don't believe. Plus, microbenchmarks have a limited value and even more limited lifespan, especially with Go in a phase of life where new versions have few feature changes and many performance/compiler changes. – Adrian Jun 30 '17 at 15:27
  • @Adrian I accept and agree with what you said. Thanks for taking the time to answer. – twosan Jun 30 '17 at 17:29
  • 1
    @twosan: x86 (and most other CPUs) can efficiently work with byte elements. A zero-extending or sign-extending load of a byte into a 64-bit register is just as cheap as a 4-byte or 8-byte load, on current Intel CPUs. Byte stores are cheap, too. It's only early Alpha AXP CPUs that could only load aligned words and had to shift/mask to get at bytes. – Peter Cordes Jul 02 '17 at 00:28
  • 1
    What would be more expensive is using a bitmap, since setting a bit to `true` requires a read-modify-write of the byte containing it, which is more expensive than a simple store. However, this is worth it when your array gets big enough, because fitting in L1 or L2 cache to avoid cache misses is worth the extra CPU overhead for each write operation. (testing a single bit from memory is still cheap, especially on x86). – Peter Cordes Jul 02 '17 at 00:31

1 Answers1

4

Looking at the assembly output (go run -gcflags '-S' test.go) there is some difference:

Bool:

0x0075 00117 (test.go:11)   MOVBLZX (AX)(BX*1), DI
0x0079 00121 (test.go:11)   TESTB   DIB, DIB

Int:

0x0075 00117 (test.go:28)   MOVQ    (AX)(BX*8), DI
0x0079 00121 (test.go:28)   CMPQ    DI, $1

Byte/uint8:

0x0075 00117 (test.go:28)   MOVBLZX (AX)(BX*1), DI
0x0079 00121 (test.go:28)   CMPB    DIB, $1

The rest of the assembly is near-identical for me on Go 1.8.*.

So: 1) data types are sized different 2) operations are different

Martin Gallagher
  • 4,444
  • 2
  • 28
  • 26
  • 1
    No, a bool is not represented as a word, it's represented as a byte, hence "move byte" vs "move quadword". https://play.golang.org/p/KVCTusPDOT – JimB Jun 30 '17 at 14:44
  • Yep I edited that immediately ;-) - [which supports you comment above, int/bool are not the same size] – Martin Gallagher Jun 30 '17 at 14:49
  • Thanks for the analysis! – twosan Jun 30 '17 at 15:03
  • It's just a cache footprint effect. The different asm sequences all have identical performance, other than cache-misses or code-size effects (which can include alignment of the rest of the code, which can affect which branches alias each other in the branch-predictor...). But anyway, `movzx` zero-extending loads have the same performance as `movq` 8-byte loads on all current CPUs, except AMD CPUs where `movzx` needs an ALU operation as well as a load-port (http://agner.org/optimize/). `test` vs. `cmp` is not significant (except for code-size); both can macro-fuse with a following `jcc`. – Peter Cordes Jul 02 '17 at 00:24
  • @PeterCordes: "It's just a cache footprint effect." is a bit like a doctor saying: "It's just brain cancer." Oh, and it's not really true anyway--along with reduced cache footprint you also get reduced memory bandwidth requirements. – Jerry Coffin Jul 02 '17 at 07:19
  • @JerryCoffin: yes, that was clumsy wording. I was just trying to say that the asm differences highlighted by this answer are *not* part of the perf difference (except for secondary effects from branch alignment), and if anything they'd hurt the `bool` / `byte` version on AMD CPUs. They do tell us that the `int` version is wasting a factor of 8 more storage for the array, though, since it's using 8-byte elements! (For large problem sizes, a bitmap instead of bool array would be worth the extra CPU overhead of needing a read-modify-write to set a bit. Testing a bit is still cheap.) – Peter Cordes Jul 02 '17 at 07:48
  • 1
    @PeterCordes: Fair enough. Testing (on at least some CPUs) seems to indicate that a bit vector is *almost* always at least as fast as a vector of bytes, and you don't need a very big problem at all for it to show an advantage. See: https://codereview.stackexchange.com/q/117880/489 for one test. – Jerry Coffin Jul 02 '17 at 15:11
  • @JerryCoffin: testing a bit is very cheap (still just a load, but with a couple extra ALU ops), but setting a bit is more expensive. I'm not sure that benchmark is as write-heavy as prime sieving (since it reads and even branches on bits with `if (!arr[i])` instead of unconditionally writing), but yeah it looks like a reasonable test if the compiler doesn't manage to optimize too much of it away. (I didn't look that closely, but I think it is trying to avoid writing sequential bits which would have allowed the compiler to merge it into a byte store). – Peter Cordes Jul 02 '17 at 15:30
  • 1
    @PeterCordes: As far as code specifically for the Sieve of Eratosthenes goes, I've done a little bit of comparison on that front myself. https://stackoverflow.com/a/16738887/179910 – Jerry Coffin Jul 02 '17 at 17:01