How to optimise this 8-bit positional popcount using assembly?

Question

This post is related to Golang assembly implement of _mm_add_epi32 , where it adds paired elements in two [8]int32 list, and returns the updated first one.

According to pprof profile, I found passing [8]int32 is expensive, so I think passing pointer of the list is much cheaper and the bech result verified this. Here's the go version:

func __mm_add_epi32_inplace_purego(x, y *[8]int32) {
    (*x)[0] += (*y)[0]
    (*x)[1] += (*y)[1]
    (*x)[2] += (*y)[2]
    (*x)[3] += (*y)[3]
    (*x)[4] += (*y)[4]
    (*x)[5] += (*y)[5]
    (*x)[6] += (*y)[6]
    (*x)[7] += (*y)[7]
}

This function is called in two levels of loop.

The algorithm computes a position population count over an array of bytes.

Thanks advice from @fuz , I know that writing whole algorithm in assembly is the best choice and makes sense, but it's beyond my ability since I never learn programming in assembly.

However, it should be easy to optimize the inner loop with assembly:

counts := make([][8]int32, numRowBytes)

for i, b = range byteSlice {
    if b == 0 {                  // more than half of elements in byteSlice is 0.
        continue
    }
    expand = _expand_byte[b]
    __mm_add_epi32_inplace_purego(&counts[i], expand)
}

// expands a byte into its bits
var _expand_byte = [256]*[8]int32{
    &[8]int32{0, 0, 0, 0, 0, 0, 0, 0},
    &[8]int32{0, 0, 0, 0, 0, 0, 0, 1},
    &[8]int32{0, 0, 0, 0, 0, 0, 1, 0},
    &[8]int32{0, 0, 0, 0, 0, 0, 1, 1},
    &[8]int32{0, 0, 0, 0, 0, 1, 0, 0},
    ...
}

Can you help to write an assembly version of __mm_add_epi32_inplace_purego (this is enough for me), or even the whole loop? Thank you in advance.

Also, what instruction set extensions are you allowed to use? Is it just SSE/SSE2? Your previous post hinted that you may be able to use everything up to AVX2. Is that correct? There are some ways to make the code a lot more efficient if certain instructions can be used. — fuz, Aug 04 '20 at 14:10
It's shown below there, just 256 [8]int32 for unpacking bits in all bytes to show which bits are set in a byte. — shenwei356, Aug 04 '20 at 14:13
So `expand_byte` is just [is there an inverse instruction to the movemask instruction in intel avx2?](https://stackoverflow.com/q/36488675), turning 8 bits into 8 `int32`? If so, you're implementing positional popcount on 8-bit bytes? You could use [Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2](https://stackoverflow.com/q/55081525) and then just combine mod-8 buckets that should map to the same bucket, or do that during the widening. See also https://github.com/mklarqvist/positional-popcount for some optimized SIMD implementations. — Peter Cordes, Aug 04 '20 at 14:14
both SSE2 and AVX2 are OK. I didn't think about this, better for available for most modern CPUs . — shenwei356, Aug 04 '20 at 14:16
@WeiShen It makes a rather big difference as many useful instructions were added with more recent instruction set extensions. Let me see... — fuz, Aug 04 '20 at 14:17
Can you call a C or C++ function? If you can call a pure asm function, you can call C++. There are good vectorized positional-popcount implementations in those languages. — Peter Cordes, Aug 04 '20 at 14:19
@PeterCordes Should be possible to translate that code to Go-style assembly. It is possible to call C functions, but the overhead is non-trivial as it involves a stack switch and reconfiguration of the runtime. — fuz, Aug 04 '20 at 14:30
@PeterCordes Wow. These routines are surprisingly complex. Translating them to Go assembly is going to be at least a week-long task. — fuz, Aug 04 '20 at 14:46
@fuz: I'd start by compiling one and translating the compiler-generated asm. But yeah, positional popcount is non-trivial if you want to do it efficiently. A naive 256x `__m256i` lookup table would be one option, or just simple inverse movemask feeding `vpaddd` should be simple enough to code easily but still much faster than scalar. — Peter Cordes, Aug 04 '20 at 14:50
@PeterCordes I was thinking about gathering corresponding bits from the input bytes and then using a series of scalar popcounts. This way, you can process 32 bytes with 8 popcounts and operations spread nicely over vector and scalar execution units. Not gonna come even close to the CSA approach from [the paper](https://arxiv.org/pdf/1911.02696.pdf) though. — fuz, Aug 04 '20 at 14:53
@fuz: oh, interesting, like with 8x `vpslld` -> `vpmovmskb` -> `popcnt` -> `add r,r`. Yeah that's good for 8-bit chunks. Note that on Intel Skylake, vector-integer shifts run on ports 0 and 1, `vpmovmskb` runs on port 0, and scalar `popcnt` runs only on port 1 because it's scalar with 3-cycle latency. Since Intel has vector and scalar ALUs sharing ports, it's not that nicely distributed. Oh, but you can left shift using `vpaddd same,same,same` (p015 on Skylake) instead of an actual shift instruction. (Maybe use one left-shift by 4 of the original vector to create some ILP.) — Peter Cordes, Aug 04 '20 at 15:00
@PeterCordes Hm... interesting... that could just work. I'll try and sketch an implementation. — fuz, Aug 04 '20 at 15:06
Wow thank you @PeterCordes and fuz, golang assembly is prefered than calling c/c++ functions. I'm trying to understand the tech details you mentioned. — shenwei356, Aug 04 '20 at 15:17
@WeiShen These are x86 instructions. Refer to the Intel Software Developer's Manual for details. I'll write a detailed answer shortly. — fuz, Aug 04 '20 at 15:21
@PeterCordes Like this? https://gist.github.com/fuzxxl/3de220b146f51e5f7bc9f4500c62847f — fuz, Aug 04 '20 at 15:29
@fuz: Yes, but I was thinking of register-dst add, using 8 integer regs. Memory-destination add is more uops (at least 2), and creates a loop-carried dependency chain. (Although with 8 different counters that's in theory enough parallelism to hide latency, but requires 6 store-forwardings to be in flight at once, if we're lucky..). And all those cache load/store operations leave fewer load buffers available for OoO exec of the SIMD loads.) For more than a few loop iterations, saving/restoring enough integer regs is probably worth it, unless cache misses on the src data are the bottleneck — Peter Cordes, Aug 04 '20 at 15:37
@PeterCordes Does make a significant difference! I get 11.3 GB/s with the counters kept in registers. The gist has been updated. — fuz, Aug 04 '20 at 16:00
@WeiShen Okay. The algorithm works, now to translate it to Go-style assembly. — fuz, Aug 04 '20 at 16:02
@PeterCordes Quite strangely, on my Haswell box, the variant with counters in registers performs worse on very large (1GB) data set sizes. Why could that be? — fuz, Aug 04 '20 at 16:04
@fuz: IDK, sounds strange. Reading `/dev/urandom` is outside the timed interval (or hopefully you used something else for 1GB of data), and CPU frequency + page-fault warmup is done before either timed region. Maybe try `.p2align 5` or `6` before the critical loops in case there's some uop-cache difference? IIRC, the `jcc`-touching-32B-boundary problem is only on Skylake-derived CPUs. — Peter Cordes, Aug 04 '20 at 16:13
@fuz: I tried it on my i7-6700k Skylake (3.9GHz, DDR4-2666), gcc10 -O2 on Arch Linux. I changed your harness to /dev/zero with `len=1024ULL * 1024*1024;` and got addreg 8.61424e+09 B/sec, addmem 7.20125e+09 B/s. (I left the warm-up runs, but commented 2 of the 3 benchmark calls, so I could perf stat it. Both ran almost exactly 3 IPC, but the addreg version ran longer, presumably because it finished in under 1 second in benchmark().) Did you try `perf stat` to check for front-end vs. back-end bottlenecks? — Peter Cordes, Aug 04 '20 at 16:21
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/219210/discussion-between-fuz-and-peter-cordes). — fuz, Aug 04 '20 at 17:26
@fuz: Please don't untag both SSE and AVX from x86 SIMD questions, especially not to include both x86 and x86-64. SSE or AVX are good catch-all tags for x86-simd questions, and I've tried to make sure all questions about SSE or AVX instructions or intrinsics are tagged with at least one of those. (Or avx512 if it's an avx512-only thing, but also avx if the same idea works for narrower vectors.) Some searchers might use `[x86*]` instead of `[sse]` or `[avx]`, so I went for the middle ground. — Peter Cordes, Jan 14 '21 at 16:58

fuz · Accepted Answer · 2020-10-30T15:37:11.553

The operation you want to perform is called a positional population count on bytes. This is a well-known operation used in machine learning and some research has been done on fast algorithms to solve this problem.

Unfortunately, the implementation of these algorithms is fairly involved. For this reason, I have developed a custom algorithm that is much simpler to implement but only yields roughly half the performance of the other other method. However, at measured 10 GB/s, it should still be a decent improvement over what you had previously.

The idea of this algorithm is to gather corresponding bits from groups of 32 bytes using vpmovmskb and then to take a scalar population count which is then added to the corresponding counter. This allows the dependency chains to be short and a consistent IPC of 3 to be reached.

Note that compared to your algorithm, my code flips the order of bits around. You can change this by editing which counts array elements the assembly code accesses if you want. However, in the interest of future readers, I'd like to leave this code with the more common convention where the least significant bit is considered bit 0.

Source code

The complete source code can be found on github. The author has meanwhile developed this algorithm idea into a portable library that can be used like this:

import "github.com/clausecker/pospop"

var counts [8]int
pospop.Count8(counts, buf)  // add positional popcounts for buf to counts

The algorithm is provided in two variants and has been tested on a machine with a processor identified as “Intel(R) Xeon(R) W-2133 CPU @ 3.60GHz.”

Positional Population Count 32 Bytes at a Time.

The counters are kept in general purpose registers for best performance. Memory is prefetched well in advance for better streaming behaviour. The scalar tail is processed using a very simple SHRL/ADCL combination. A performance of up to 11 GB/s is achieved.

#include "textflag.h"

// func PospopcntReg(counts *[8]int32, buf []byte)
TEXT ·PospopcntReg(SB),NOSPLIT,$0-32
    MOVQ counts+0(FP), DI
    MOVQ buf_base+8(FP), SI     // SI = &buf[0]
    MOVQ buf_len+16(FP), CX     // CX = len(buf)

    // load counts into register R8--R15
    MOVL 4*0(DI), R8
    MOVL 4*1(DI), R9
    MOVL 4*2(DI), R10
    MOVL 4*3(DI), R11
    MOVL 4*4(DI), R12
    MOVL 4*5(DI), R13
    MOVL 4*6(DI), R14
    MOVL 4*7(DI), R15

    SUBQ $32, CX            // pre-subtract 32 bit from CX
    JL scalar

vector: VMOVDQU (SI), Y0        // load 32 bytes from buf
    PREFETCHT0 384(SI)      // prefetch some data
    ADDQ $32, SI            // advance SI past them

    VPMOVMSKB Y0, AX        // move MSB of Y0 bytes to AX
    POPCNTL AX, AX          // count population of AX
    ADDL AX, R15            // add to counter
    VPADDD Y0, Y0, Y0       // shift Y0 left by one place

    VPMOVMSKB Y0, AX        // move MSB of Y0 bytes to AX
    POPCNTL AX, AX          // count population of AX
    ADDL AX, R14            // add to counter
    VPADDD Y0, Y0, Y0       // shift Y0 left by one place

    VPMOVMSKB Y0, AX        // move MSB of Y0 bytes to AX
    POPCNTL AX, AX          // count population of AX
    ADDL AX, R13            // add to counter
    VPADDD Y0, Y0, Y0       // shift Y0 left by one place

    VPMOVMSKB Y0, AX        // move MSB of Y0 bytes to AX
    POPCNTL AX, AX          // count population of AX
    ADDL AX, R12            // add to counter
    VPADDD Y0, Y0, Y0       // shift Y0 left by one place

    VPMOVMSKB Y0, AX        // move MSB of Y0 bytes to AX
    POPCNTL AX, AX          // count population of AX
    ADDL AX, R11            // add to counter
    VPADDD Y0, Y0, Y0       // shift Y0 left by one place

    VPMOVMSKB Y0, AX        // move MSB of Y0 bytes to AX
    POPCNTL AX, AX          // count population of AX
    ADDL AX, R10            // add to counter
    VPADDD Y0, Y0, Y0       // shift Y0 left by one place

    VPMOVMSKB Y0, AX        // move MSB of Y0 bytes to AX
    POPCNTL AX, AX          // count population of AX
    ADDL AX, R9         // add to counter
    VPADDD Y0, Y0, Y0       // shift Y0 left by one place

    VPMOVMSKB Y0, AX        // move MSB of Y0 bytes to AX
    POPCNTL AX, AX          // count population of AX
    ADDL AX, R8         // add to counter

    SUBQ $32, CX
    JGE vector          // repeat as long as bytes are left

scalar: ADDQ $32, CX            // undo last subtraction
    JE done             // if CX=0, there's nothing left

loop:   MOVBLZX (SI), AX        // load a byte from buf
    INCQ SI             // advance past it

    SHRL $1, AX         // CF=LSB, shift byte to the right
    ADCL $0, R8         // add CF to R8

    SHRL $1, AX
    ADCL $0, R9         // add CF to R9

    SHRL $1, AX
    ADCL $0, R10            // add CF to R10

    SHRL $1, AX
    ADCL $0, R11            // add CF to R11

    SHRL $1, AX
    ADCL $0, R12            // add CF to R12

    SHRL $1, AX
    ADCL $0, R13            // add CF to R13

    SHRL $1, AX
    ADCL $0, R14            // add CF to R14

    SHRL $1, AX
    ADCL $0, R15            // add CF to R15

    DECQ CX             // mark this byte as done
    JNE loop            // and proceed if any bytes are left

    // write R8--R15 back to counts
done:   MOVL R8, 4*0(DI)
    MOVL R9, 4*1(DI)
    MOVL R10, 4*2(DI)
    MOVL R11, 4*3(DI)
    MOVL R12, 4*4(DI)
    MOVL R13, 4*5(DI)
    MOVL R14, 4*6(DI)
    MOVL R15, 4*7(DI)

    VZEROUPPER          // restore SSE-compatibility
    RET

Positional Population Count 96 Bytes at a Time with CSA

This variant performs all of the optimisations above but reduces 96 bytes to 64 using a single CSA step beforehand. As expected, this improves the performance by roughly 30% and achieves up to 16 GB/s.

#include "textflag.h"

// func PospopcntRegCSA(counts *[8]int32, buf []byte)
TEXT ·PospopcntRegCSA(SB),NOSPLIT,$0-32
    MOVQ counts+0(FP), DI
    MOVQ buf_base+8(FP), SI     // SI = &buf[0]
    MOVQ buf_len+16(FP), CX     // CX = len(buf)

    // load counts into register R8--R15
    MOVL 4*0(DI), R8
    MOVL 4*1(DI), R9
    MOVL 4*2(DI), R10
    MOVL 4*3(DI), R11
    MOVL 4*4(DI), R12
    MOVL 4*5(DI), R13
    MOVL 4*6(DI), R14
    MOVL 4*7(DI), R15

    SUBQ $96, CX            // pre-subtract 32 bit from CX
    JL scalar

vector: VMOVDQU (SI), Y0        // load 96 bytes from buf into Y0--Y2
    VMOVDQU 32(SI), Y1
    VMOVDQU 64(SI), Y2
    ADDQ $96, SI            // advance SI past them
    PREFETCHT0 320(SI)
    PREFETCHT0 384(SI)

    VPXOR Y0, Y1, Y3        // first adder: sum
    VPAND Y0, Y1, Y0        // first adder: carry out
    VPAND Y2, Y3, Y1        // second adder: carry out
    VPXOR Y2, Y3, Y2        // second adder: sum (full sum)
    VPOR Y0, Y1, Y0         // full adder: carry out

    VPMOVMSKB Y0, AX        // MSB of carry out bytes
    VPMOVMSKB Y2, DX        // MSB of sum bytes
    VPADDB Y0, Y0, Y0       // shift carry out bytes left
    VPADDB Y2, Y2, Y2       // shift sum bytes left
    POPCNTL AX, AX          // carry bytes population count
    POPCNTL DX, DX          // sum bytes population count
    LEAL (DX)(AX*2), AX     // sum popcount plus 2x carry popcount
    ADDL AX, R15

    VPMOVMSKB Y0, AX        // MSB of carry out bytes
    VPMOVMSKB Y2, DX        // MSB of sum bytes
    VPADDB Y0, Y0, Y0       // shift carry out bytes left
    VPADDB Y2, Y2, Y2       // shift sum bytes left
    POPCNTL AX, AX          // carry bytes population count
    POPCNTL DX, DX          // sum bytes population count
    LEAL (DX)(AX*2), AX     // sum popcount plus 2x carry popcount
    ADDL AX, R14

    VPMOVMSKB Y0, AX        // MSB of carry out bytes
    VPMOVMSKB Y2, DX        // MSB of sum bytes
    VPADDB Y0, Y0, Y0       // shift carry out bytes left
    VPADDB Y2, Y2, Y2       // shift sum bytes left
    POPCNTL AX, AX          // carry bytes population count
    POPCNTL DX, DX          // sum bytes population count
    LEAL (DX)(AX*2), AX     // sum popcount plus 2x carry popcount
    ADDL AX, R13

    VPMOVMSKB Y0, AX        // MSB of carry out bytes
    VPMOVMSKB Y2, DX        // MSB of sum bytes
    VPADDB Y0, Y0, Y0       // shift carry out bytes left
    VPADDB Y2, Y2, Y2       // shift sum bytes left
    POPCNTL AX, AX          // carry bytes population count
    POPCNTL DX, DX          // sum bytes population count
    LEAL (DX)(AX*2), AX     // sum popcount plus 2x carry popcount
    ADDL AX, R12

    VPMOVMSKB Y0, AX        // MSB of carry out bytes
    VPMOVMSKB Y2, DX        // MSB of sum bytes
    VPADDB Y0, Y0, Y0       // shift carry out bytes left
    VPADDB Y2, Y2, Y2       // shift sum bytes left
    POPCNTL AX, AX          // carry bytes population count
    POPCNTL DX, DX          // sum bytes population count
    LEAL (DX)(AX*2), AX     // sum popcount plus 2x carry popcount
    ADDL AX, R11

    VPMOVMSKB Y0, AX        // MSB of carry out bytes
    VPMOVMSKB Y2, DX        // MSB of sum bytes
    VPADDB Y0, Y0, Y0       // shift carry out bytes left
    VPADDB Y2, Y2, Y2       // shift sum bytes left
    POPCNTL AX, AX          // carry bytes population count
    POPCNTL DX, DX          // sum bytes population count
    LEAL (DX)(AX*2), AX     // sum popcount plus 2x carry popcount
    ADDL AX, R10

    VPMOVMSKB Y0, AX        // MSB of carry out bytes
    VPMOVMSKB Y2, DX        // MSB of sum bytes
    VPADDB Y0, Y0, Y0       // shift carry out bytes left
    VPADDB Y2, Y2, Y2       // shift sum bytes left
    POPCNTL AX, AX          // carry bytes population count
    POPCNTL DX, DX          // sum bytes population count
    LEAL (DX)(AX*2), AX     // sum popcount plus 2x carry popcount
    ADDL AX, R9

    VPMOVMSKB Y0, AX        // MSB of carry out bytes
    VPMOVMSKB Y2, DX        // MSB of sum bytes
    POPCNTL AX, AX          // carry bytes population count
    POPCNTL DX, DX          // sum bytes population count
    LEAL (DX)(AX*2), AX     // sum popcount plus 2x carry popcount
    ADDL AX, R8

    SUBQ $96, CX
    JGE vector          // repeat as long as bytes are left

scalar: ADDQ $96, CX            // undo last subtraction
    JE done             // if CX=0, there's nothing left

loop:   MOVBLZX (SI), AX        // load a byte from buf
    INCQ SI             // advance past it

    SHRL $1, AX         // is bit 0 set?
    ADCL $0, R8         // add it to R8

    SHRL $1, AX         // is bit 0 set?
    ADCL $0, R9         // add it to R9

    SHRL $1, AX         // is bit 0 set?
    ADCL $0, R10            // add it to R10

    SHRL $1, AX         // is bit 0 set?
    ADCL $0, R11            // add it to R11

    SHRL $1, AX         // is bit 0 set?
    ADCL $0, R12            // add it to R12

    SHRL $1, AX         // is bit 0 set?
    ADCL $0, R13            // add it to R13

    SHRL $1, AX         // is bit 0 set?
    ADCL $0, R14            // add it to R14

    SHRL $1, AX         // is bit 0 set?
    ADCL $0, R15            // add it to R15

    DECQ CX             // mark this byte as done
    JNE loop            // and proceed if any bytes are left

    // write R8--R15 back to counts
done:   MOVL R8, 4*0(DI)
    MOVL R9, 4*1(DI)
    MOVL R10, 4*2(DI)
    MOVL R11, 4*3(DI)
    MOVL R12, 4*4(DI)
    MOVL R13, 4*5(DI)
    MOVL R14, 4*6(DI)
    MOVL R15, 4*7(DI)

    VZEROUPPER          // restore SSE-compatibility
    RET

Benchmarks

Here are benchmarks for the two algorithms and a naïve reference implementation in pure Go. Full benchmarks can be found in the github repository.

BenchmarkReference/10-12    12448764            80.9 ns/op   123.67 MB/s
BenchmarkReference/32-12     4357808           258 ns/op     124.25 MB/s
BenchmarkReference/1000-12            151173          7889 ns/op     126.76 MB/s
BenchmarkReference/2000-12             68959         15774 ns/op     126.79 MB/s
BenchmarkReference/4000-12             36481         31619 ns/op     126.51 MB/s
BenchmarkReference/10000-12            14804         78917 ns/op     126.72 MB/s
BenchmarkReference/100000-12            1540        789450 ns/op     126.67 MB/s
BenchmarkReference/10000000-12            14      77782267 ns/op     128.56 MB/s
BenchmarkReference/1000000000-12           1    7781360044 ns/op     128.51 MB/s
BenchmarkReg/10-12                  49255107            24.5 ns/op   407.42 MB/s
BenchmarkReg/32-12                  186935192            6.40 ns/op 4998.53 MB/s
BenchmarkReg/1000-12                 8778610           115 ns/op    8677.33 MB/s
BenchmarkReg/2000-12                 5358495           208 ns/op    9635.30 MB/s
BenchmarkReg/4000-12                 3385945           357 ns/op    11200.23 MB/s
BenchmarkReg/10000-12                1298670           901 ns/op    11099.24 MB/s
BenchmarkReg/100000-12                115629          8662 ns/op    11544.98 MB/s
BenchmarkReg/10000000-12                1270        916817 ns/op    10907.30 MB/s
BenchmarkReg/1000000000-12                12      93609392 ns/op    10682.69 MB/s
BenchmarkRegCSA/10-12               48337226            23.9 ns/op   417.92 MB/s
BenchmarkRegCSA/32-12               12843939            80.2 ns/op   398.86 MB/s
BenchmarkRegCSA/1000-12              7175629           150 ns/op    6655.70 MB/s
BenchmarkRegCSA/2000-12              3988408           295 ns/op    6776.20 MB/s
BenchmarkRegCSA/4000-12              3016693           382 ns/op    10467.41 MB/s
BenchmarkRegCSA/10000-12             1810195           642 ns/op    15575.65 MB/s
BenchmarkRegCSA/100000-12             191974          6229 ns/op    16053.40 MB/s
BenchmarkRegCSA/10000000-12             1622        698856 ns/op    14309.10 MB/s
BenchmarkRegCSA/1000000000-12             16      68540642 ns/op    14589.88 MB/s

you're amazing, the implements are awesome, lightning fast. But my case is pospopcnt on column-wise bytes array in a byte matrix, so I have to prepare `[]byte` for a column before counting, where however popping single byte from rows is very slow with `NOPL` instruction costing too much time. — shenwei356, Aug 05 '20 at 00:32
This turns another problem: how to fast transpose byte matrix `[][]byte` in Golang assembly — shenwei356, Aug 05 '20 at 01:32
After searching I found no assembly implement and just used pure go and carefully tuned the matrix row size which effect the performance too (decreased when > 512 or < 32). After last your `PospopcntReg` brings a 2X speedup, thank you @fuz . — shenwei356, Aug 05 '20 at 02:29
Well, transpose is a whol new can of worms. It is possible to do that using scater/gather operations, but it's not going to be fast. Consider changing the layout of yor data structure if possible. — fuz, Aug 05 '20 at 06:58
I created another post. https://stackoverflow.com/questions/63257822/fast-transpose-byte-matrix-byte-in-golang-assembly . And I've used fixed size (64) of rows and 64×n columns in original matrix which is cache-friendly for transposing to `[][64]byte` and later being passed to pospopcnt. This improves the performance a little. — shenwei356, Aug 05 '20 at 08:08
PospopcntReg seems does not support slice with single byte, cause it processes the first and then others. I want apply it to every element of a byte matrix to avoid transpose. — shenwei356, Aug 05 '20 at 16:28
@shenwei356 I cannot reproduce the issue. I have uploaded a new version of the source code adding a test case with a single byte. Can you provide a test case that demonstrates your issue? — fuz, Aug 05 '20 at 20:56
@shenwei356 I have tried that (see the test case I added). Please provide an example of the incorrect behaviour. — fuz, Aug 05 '20 at 22:59
Sorry, I just woke up and didn't refresh my mind at that time. It's due to bug of my code. Just now I check it again and am pretty sure your code is right. And it's a bad idea for pospopcnt for every byte in the matrix after trying. — shenwei356, Aug 06 '20 at 00:00
@shenwei356 Yeah. The code I wrote works best with a multiple of 96 bytes in the slice. Everything with less than that won't be particularly fast. I have added further improvements to the github repository. — fuz, Aug 06 '20 at 00:04
I see, that's amazing. I chose 64 bytes for being cache-friendly and for time balance between matrix transpose (more time costing and affected by size) and pospopcnt. The data is stored in `[64]byte` and convert to slice using `buf[:]`. — shenwei356, Aug 06 '20 at 00:24

How to optimise this 8-bit positional popcount using assembly?

1 Answers1

Source code

Positional Population Count 32 Bytes at a Time.

Positional Population Count 96 Bytes at a Time with CSA

Benchmarks

Linked