Faster bitwise AND operation on byte slices

Question

I'd like to perform bitwise AND on every column of a byte matrix, which is stored in a [][]byte in golang. I created a repo with runnable test code.

It can be simplified as a bitwise AND operation on two byte-slice of equal length. The simplest way is using for loop to handle every pair of bytes.

func and(x, y []byte) []byte {
    z := make([]byte, lenght(x))
    for i:= 0; i < len(x); i++ {
        z[i] = x[i] & y[i]
    }
    return z
}

However, it's very slow for long slices. A faster way is to unroll the for loop (check the benchmark result)

BenchmarkLoop-16                   14467             84265 ns/op
BenchmarkUnrollLoop-16             17668             67550 ns/op

Any faster way? Go assembly?

Thank you in advance.

What exactly are you benchmarking? Shouldn't `make(...)` call be outside of the benchmark loop? — Ruslan, Jul 07 '21 at 07:12
SIMD is very good for bitwise-AND of two contiguous chunks of memory, into a 3rd. It should run at STREAM Triad speed on your machine, or faster if the data fits in cache instead of having to come from main memory. If Go doesn't auto-vectorize, then there's a lot of room to gain from using some way of getting SIMD instructions. — Peter Cordes, Jul 07 '21 at 07:20
@Ruslan, which line? https://github.com/shenwei356/bench/blob/main/bitwise-and-on-byte-slices/test_test.go#L60 ? There's no `make` inside the benchmark loop. — shenwei356, Jul 07 '21 at 08:04
@PeterCordes Right but there's no way to directly call SMID instructions in Go, have to write go assembly :( which I'm not familiar with. — shenwei356, Jul 07 '21 at 08:08
This is a pretty simple problem to auto-vectorize; a compiler would just have to check whether the output slice overlapped either input. If your compiler doesn't auto-vectorize, your options are: use a different compiler (possible for a different language via a foreign-function interface), use assembly, or have your code run slow. Or maybe have your code at best work 8 bytes at a time, instead of 16 or 32, if Go lets you type-pun your pointer/slice types to use 64-bit chunks for the multiple-of-8 parts of the byte slices. I don't know Go. — Peter Cordes, Jul 07 '21 at 12:55
@PeterCorder The Go compiler does not auto-vectorise and calling FFI functions incurs a costly overhead. Type punning is somewhat possible but frowned upon. The best choice is write the relevant code in Go-style assembly. You can have the C compiler do that for you and then just copy-edit the calling convention. — fuz, Jul 07 '21 at 14:36
Not enough information to answer this. You need to example example data, even if its dummy data — Zombo, Jul 07 '21 at 14:46
I found a good implementation, which turns every 8 bytes into `uint64`s and computes AND on these numbers. It's 3X faster than the canonical `for` loop way . https://github.com/grailbio/base/blob/master/simd/and_amd64.go#L32 — shenwei356, Jul 09 '21 at 11:29

shenwei356 · Answer 1 · 2021-07-15T02:54:01.210

I write a go assembly implementation using AVX2 instructions after two days of (go) assembly learning.

The performance is good, 10X of simple loop version. While optimizations for compatibility and performance are still needed. Suggestions and PRs are welcome.

Note: code and benchmark results are updated.

I appreciate @PeterCordes for many valuable suggestions.

#include "textflag.h"

// func AND(x []byte, y []byte)
// Requires: AVX
TEXT ·AND(SB), NOSPLIT|NOPTR, $0-48
    // pointer of x
    MOVQ x_base+0(FP), AX

    // length of x
    MOVQ x_len+8(FP), CX

    // pointer of y
    MOVQ y_base+24(FP), DX

    // --------------------------------------------
    // end address of x, will not change: p + n
    MOVQ AX, BX
    ADDQ CX, BX

    // end address for loop
    // n <= 8, jump to tail
    CMPQ CX, $0x00000008
    JLE  tail

    // n < 16, jump to loop8
    CMPQ CX, $0x00000010
    JL   loop8_start

    // n < 32, jump to loop16
    CMPQ CX, $0x00000020
    JL   loop16_start

    // --------------------------------------------
    // end address for loop32
    MOVQ BX, CX
    SUBQ $0x0000001f, CX

loop32:
    // compute x & y, and save value to x
    VMOVDQU (AX), Y0
    VANDPS  (DX), Y0, Y0
    VMOVDQU Y0, (AX)

    // move pointer
    ADDQ $0x00000020, AX
    ADDQ $0x00000020, DX
    CMPQ AX, CX
    JL   loop32

    // n <= 8, jump to tail
    MOVQ BX, CX
    SUBQ AX, CX
    CMPQ CX, $0x00000008
    JLE  tail

    // n < 16, jump to loop8
    CMPQ CX, $0x00000010
    JL   loop8_start

    // --------------------------------------------
loop16_start:
    // end address for loop16
    MOVQ BX, CX
    SUBQ $0x0000000f, CX

loop16:
    // compute x & y, and save value to x
    VMOVDQU (AX), X0
    VANDPS  (DX), X0, X0
    VMOVDQU X0, (AX)

    // move pointer
    ADDQ $0x00000010, AX
    ADDQ $0x00000010, DX
    CMPQ AX, CX
    JL   loop16

    // n <= 8, jump to tail
    MOVQ BX, CX
    SUBQ AX, CX
    CMPQ CX, $0x00000008
    JLE  tail

    // --------------------------------------------
loop8_start:
    // end address for loop8
    MOVQ BX, CX
    SUBQ $0x00000007, CX

loop8:
    // compute x & y, and save value to x
    MOVQ (AX), BX
    ANDQ (DX), BX
    MOVQ BX, (AX)

    // move pointer
    ADDQ $0x00000008, AX
    ADDQ $0x00000008, DX
    CMPQ AX, CX
    JL   loop8

    // --------------------------------------------
tail:
    // left elements (<=8)
    MOVQ (AX), BX
    ANDQ (DX), BX
    MOVQ BX, (AX)
    RET

Benchmark result:

test                       data-size        time
-------------------        ---------        -----------
BenchmarkGrailbio          8.00_B           4.654 ns/op
BenchmarkGoAsm             8.00_B           4.824 ns/op
BenchmarkUnrollLoop        8.00_B           6.851 ns/op
BenchmarkLoop              8.00_B           8.683 ns/op

BenchmarkGrailbio          16.00_B          5.363 ns/op
BenchmarkGoAsm             16.00_B          6.369 ns/op
BenchmarkUnrollLoop        16.00_B          10.47 ns/op
BenchmarkLoop              16.00_B          13.48 ns/op

BenchmarkGoAsm             32.00_B          6.079 ns/op
BenchmarkGrailbio          32.00_B          6.497 ns/op
BenchmarkUnrollLoop        32.00_B          17.46 ns/op
BenchmarkLoop              32.00_B          21.09 ns/op

BenchmarkGoAsm             128.00_B         10.52 ns/op
BenchmarkGrailbio          128.00_B         14.40 ns/op
BenchmarkUnrollLoop        128.00_B         56.97 ns/op
BenchmarkLoop              128.00_B         80.12 ns/op

BenchmarkGoAsm             256.00_B         15.48 ns/op
BenchmarkGrailbio          256.00_B         23.76 ns/op
BenchmarkUnrollLoop        256.00_B         110.8 ns/op
BenchmarkLoop              256.00_B         147.5 ns/op

BenchmarkGoAsm             1.00_KB          47.16 ns/op
BenchmarkGrailbio          1.00_KB          87.75 ns/op
BenchmarkUnrollLoop        1.00_KB          443.1 ns/op
BenchmarkLoop              1.00_KB          540.5 ns/op

BenchmarkGoAsm             16.00_KB         751.6 ns/op
BenchmarkGrailbio          16.00_KB         1342 ns/op
BenchmarkUnrollLoop        16.00_KB         7007 ns/op
BenchmarkLoop              16.00_KB         8623 ns/op

You can save instructions / uops by using a loop structure like `do{}while(ptr < endp)`, with a `cmp / jb` at the bottom involving AX or DX. [Why are loops always compiled into "do...while" style (tail jump)?](https://stackoverflow.com/q/47783926). You'd still need a check in case the vector loop should run 0 times, but that check can be outside the loop. (And you'd want an LEA to compute `end_pointer = p + size - 0x1f` in a register instead of using CX as a down-counter). If you aren't memory bottlenecked, this could even make your loop run faster, else just more hyperthreading friendly. — Peter Cordes, Jul 13 '21 at 20:34
You don't need to xor-zero Y0 before the first load. `VMOVAPD (AX), Y0` overwrites Y0 without caring about the previous value. Also, I'd recommend `vmovdqa` SIMD-integer load/store instructions, not `vmovapd` (packed-double). It won't actually matter on most CPUs, I don't think. If you want, you could avoid AVX2 by only using `vmovaps` (or pd) and `vandps`. — Peter Cordes, Jul 13 '21 at 20:37
Hi @PeterCordes , I changed to code following all your suggestions, thank you so much! Should I replace the code in the answer or append a new version? Anyway, it's available on the github. — shenwei356, Jul 14 '21 at 05:26
I'd recommend just updating the code. You could mention that the old version is still there in the edit history, especially if you want include a separate benchmark line for it. — Peter Cordes, Jul 14 '21 at 05:40
@PeterCordes , `vmovdqu` instead of `vmovdqa` should be used, because data may not be aligned, at least for Go slice :P :( I find this bug for nearly one whole day... :( — shenwei356, Jul 15 '21 at 02:19
Well yeah, if your data isn't guaranteed to be aligned, then you wouldn't want to use misalignment-detecting loads; letting the hardware handle possibly misaligned loads/stores is generally fine. Fortunately for you, AVX memory source operands for other instructions (like `vpand`) don't require alignment. — Peter Cordes, Jul 15 '21 at 02:25
I found `VPAND` bring slightly higher performance than `VANDPS`. And the package supports AVX512 now. Thank you @PeterCordes and @fuz ! — shenwei356, Jul 15 '21 at 11:57

Faster bitwise AND operation on byte slices

1 Answers1