Performance discrepancy in compiled vs. hand-written assembly

Question

I've been playing around with using assembly language in Go and I've written a Hamming Weight function as an excercise.

I've based a native Go version on this SO answer and the assembly version is based on this doc from AMD (page 180). Upon benchmarking the two functions, I find that the native Go version is around 1.5x - 2x faster than the assembly version, despite the hand-written assembly version being almost identical to the output from go tool 6g -S popcount.go.

output from go test -bench=.

PASS
BenchmarkPopCount       100000000               19.4 ns/op 
BenchmarkPopCount_g     200000000                8.97 ns/op
ok      popcount   4.777s

popcount.go

package popcount

func popCount(i uint32) uint32 // Defined in popcount_amd64.s

func popCount_g(i uint32) uint32 {
    i = i - ((i >> 1) & 0x55555555)
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333)
    return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24
}

popcount_test.go

package popcount

import "testing"

func TestPopcount(t *testing.T) {
    for i := uint32(0); i < uint32(100); i++ {
        if popCount(i) != popCount_g(i) {
            t.Fatalf("failed on input = %v", i)
        }
    }
}

func BenchmarkPopCount(b *testing.B) {
    for i := 0; i < b.N; i++ {
        popCount(uint32(i))
    }
}

func BenchmarkPopCount_g(b *testing.B) {
    for i := 0; i < b.N; i++ {
        popCount_g(uint32(i))
    }
}

popcount_amd64.s

// func popCount(i uint32) uint32
TEXT ·popCount(SB),$0
    MOVL i+0(FP), BP        // i
    MOVL BP, BX             // i
    SHRL $1, BX             // i >> 1
    ANDL $0x055555555, BX   // (i >> 1) & 0x55555555
    SUBL BX, BP             // w = i - ((i >> 1) & 0x55555555)
    MOVL BP, AX             // w
    SHRL $2, BP             // w >> 2
    ANDL $0x033333333, AX   // w & 0x33333333
    ANDL $0x033333333, BP   // (w >> 2) & 0x33333333
    ADDL BP, AX             // x = (w & 0x33333333) + ((w >> 2) & 0x33333333)
    MOVL AX, BX             // x
    SHRL $4, BX             // x >> 4
    ADDL AX, BX             // x + (x >> 4)
    ANDL $0x00F0F0F0F, BX   // y = (x + (x >> 4) & 0x0F0F0F0F)
    IMULL $0x001010101, BX  // y * 0x01010101
    SHRL $24, BX            // population count = (y * 0x01010101) >> 24
    MOVL BX, toReturn+8(FP) // Store result.
    RET                     // return

output from go tool 6g -S popcount.go

"".popCount_g t=1 size=64 value=0 args=0x10 locals=0
        000000 00000 (popcount.go:5)    TEXT    "".popCount_g+0(SB),4,$0-16
        000000 00000 (popcount.go:5)    NOP     ,
        000000 00000 (popcount.go:5)    NOP     ,
        000000 00000 (popcount.go:5)    MOVL    "".i+8(FP),BP
        0x0004 00004 (popcount.go:5)    FUNCDATA        $2,gclocals┬À9308e7ef08d2cc2f72ae1228688dacf9+0(SB)
        0x0004 00004 (popcount.go:5)    FUNCDATA        $3,gclocals┬À3280bececceccd33cb74587feedb1f9f+0(SB)
        0x0004 00004 (popcount.go:6)    MOVL    BP,BX
        0x0006 00006 (popcount.go:6)    SHRL    $1,BX
        0x0008 00008 (popcount.go:6)    ANDL    $1431655765,BX
        0x000e 00014 (popcount.go:6)    SUBL    BX,BP
        0x0010 00016 (popcount.go:7)    MOVL    BP,AX
        0x0012 00018 (popcount.go:7)    ANDL    $858993459,AX
        0x0017 00023 (popcount.go:7)    SHRL    $2,BP
        0x001a 00026 (popcount.go:7)    ANDL    $858993459,BP
        0x0020 00032 (popcount.go:7)    ADDL    BP,AX
        0x0022 00034 (popcount.go:8)    MOVL    AX,BX
        0x0024 00036 (popcount.go:8)    SHRL    $4,BX
        0x0027 00039 (popcount.go:8)    ADDL    AX,BX
        0x0029 00041 (popcount.go:8)    ANDL    $252645135,BX
        0x002f 00047 (popcount.go:8)    IMULL   $16843009,BX
        0x0035 00053 (popcount.go:8)    SHRL    $24,BX
        0x0038 00056 (popcount.go:8)    MOVL    BX,"".~r1+16(FP)
        0x003c 00060 (popcount.go:8)    RET     ,

I know from here that the FUNCDATA lines contain information for the garbage collector, but other than that I don't see any glaring differences.

What could be causing this large difference in speed between the 2 functions?

One thing to bear in mind is that the linker in go is a lot cleverer than a traditional C linker. In particular it can transform instructions, re-order code etc, so I suggest you look at the final binaries with objdump rather than comparing the go assembler output. — Nick Craig-Wood, Aug 24 '14 at 11:50
@NickCraig-Wood objdump shows that the assembly is still very similar, however the hand-written assembly version shows `callq 420610 ` before calling the function. The native go version has the entire assembly in `main.main` so I'm guessing the difference is the time taken to allocate more memory, as the extra jump shouldn't take that long. Any thoughts? — Intermernet, Aug 24 '14 at 12:57
@NickCraig-Wood as per peterSO's answer, it looks like the call-overhead *can* cost a few ns per op. Ah well, now I need to try to learn how to inline the assembly code if I can... *Down the rabbit hole I go!* — Intermernet, Aug 24 '14 at 13:57
AFAIK it isn't possible to inline assembly code :-( It has certainly been discussed on the dev list in the past. — Nick Craig-Wood, Aug 24 '14 at 14:04
@NickCraig-Wood Aha, good to know before I dig too deep. Thanks again. — Intermernet, Aug 24 '14 at 14:05

score 8 · Accepted Answer · answered Aug 24 '14 at 13:12

If you look at the pseudo-assembler for the funnction calls you will see that the .s (assembler) version is called, with call overhead, and the .go version is inlined.

func S() {
    pc := popCount(uint32(0))
    _ = pc
}

"".S t=1 size=48 value=0 args=0x0 locals=0x10
    0x0000 00000 (popcount.go:11)   TEXT    "".S+0(SB),$16-0
    0x0000 00000 (popcount.go:11)   MOVQ    (TLS),CX
    0x0009 00009 (popcount.go:11)   CMPQ    SP,(CX)
    0x000c 00012 (popcount.go:11)   JHI ,21
    0x000e 00014 (popcount.go:11)   CALL    ,runtime.morestack00_noctxt(SB)
    0x0013 00019 (popcount.go:11)   JMP ,0
    0x0015 00021 (popcount.go:11)   SUBQ    $16,SP
    0x0019 00025 (popcount.go:11)   FUNCDATA    $2,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB)
    0x0019 00025 (popcount.go:11)   FUNCDATA    $3,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB)
    0x0019 00025 (popcount.go:12)   MOVL    $0,(SP)
    0x0020 00032 (popcount.go:12)   PCDATA  $1,$0
    0x0020 00032 (popcount.go:12)   CALL    ,"".popCount(SB)
    0x0025 00037 (popcount.go:12)   MOVL    8(SP),BX
    0x0029 00041 (popcount.go:12)   NOP ,
    0x0029 00041 (popcount.go:14)   ADDQ    $16,SP
    0x002d 00045 (popcount.go:14)   RET ,

func S_G() {
    pc := popCount_g(uint32(0))
    _ = pc
}

"".S_G t=1 size=64 value=0 args=0x0 locals=0x8
    0x0000 00000 (popcount.go:16)   TEXT    "".S_G+0(SB),4,$8-0
    0x0000 00000 (popcount.go:16)   SUBQ    $8,SP
    0x0004 00004 (popcount.go:16)   FUNCDATA    $2,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB)
    0x0004 00004 (popcount.go:16)   FUNCDATA    $3,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB)
    0x0004 00004 (popcount.go:17)   MOVL    $0,BP
    0x0006 00006 (popcount.go:17)   MOVL    BP,BX
    0x0008 00008 (popcount.go:17)   SHRL    $1,BX
    0x000a 00010 (popcount.go:17)   ANDL    $1431655765,BX
    0x0010 00016 (popcount.go:17)   SUBL    BX,BP
    0x0012 00018 (popcount.go:17)   MOVL    BP,AX
    0x0014 00020 (popcount.go:17)   ANDL    $858993459,AX
    0x0019 00025 (popcount.go:17)   SHRL    $2,BP
    0x001c 00028 (popcount.go:17)   ANDL    $858993459,BP
    0x0022 00034 (popcount.go:17)   ADDL    BP,AX
    0x0024 00036 (popcount.go:17)   MOVL    AX,BX
    0x0026 00038 (popcount.go:17)   SHRL    $4,BX
    0x0029 00041 (popcount.go:17)   ADDL    AX,BX
    0x002b 00043 (popcount.go:17)   ANDL    $252645135,BX
    0x0031 00049 (popcount.go:17)   IMULL   $16843009,BX
    0x0037 00055 (popcount.go:17)   SHRL    $24,BX
    0x003a 00058 (popcount.go:19)   ADDQ    $8,SP
    0x003e 00062 (popcount.go:19)   RET ,

Thanks, you are correct (I discovered the same thing using @NickCraig-Wood 's suggestion of using objdump to examine the final binaries). — Intermernet, Aug 24 '14 at 13:21

score 4 · Answer 2 · edited Aug 25 '14 at 03:55

4

If you actually want a popcnt, it's been a native instruction on X86 processors for a while now. Your compiler may have an intrinsic for it, such as Microsoft's __popcnt() (also __popcnt16() and __popcnt64() ), or GCC's __builtin_popcount() (or __builtin_popcountll() for 64 bit).

edited Aug 25 '14 at 03:55

xmojmr

8,073
5
31
54

answered Aug 25 '14 at 02:30

rcgldr

27,407
3
36
61

2

Yes, I'm aware of the `popcnt` instruction on recent Intel and AMD processors. This was really just an exercise in learning the Plan9 based dialect of assembly that Go uses, and I came across this performance difference while testing. I have a feeling that certain instructions (such as `popcnt`) aren't supported by the Plan9 / Go assembly language, and there doesn't seem to be a `popcnt` or similar function available. – Intermernet Aug 25 '14 at 02:40
1

It's not supported last I checked, but you can encode the instruction yourself and include the literal bytes in hex format in the assembly, e.g. BYTE $0xf3; BYTE $0x48; BYTE $0x0f; BYTE $0xb8; BYTE $0x16 // POPCNTQ (SI), DX – Eloff Jul 11 '15 at 00:28

Performance discrepancy in compiled vs. hand-written assembly

2 Answers2