Fast SIMD bit deposition in a regular pattern

Question

Edit: The last method described below is by far the fastest, giving a 5+ times speedup on smaller datasets; the problem was the benchmark itself, which caused page faults and had too much data to fit in cache.

My problem is the following: I have a large piece of data in memory. I want to generate the result of spreading out that data by a factor of three (what I'll call triplification, for lack of a better word), with the intermediate bits being 0. To be explicit, I want

11011011…

to be converted into

100100000100100000100100…

(where the convention is little-endian). What I’ve tried:

Straightforwardly iterating over the bytes and using a 256-entry 8-bit lookup table, generating 24 bits at a time. Pretty fast since the table is small and probably hot in L1d. (~7.2 cycles / byte)
Using a 65536-entry 16-bit lookup table and unrolling the loop somewhat. (~5.01 c/byte)
Using a series of pdep (bit deposit) instructions by grabbing sets of 21 or 22 bits and performing bit deposition into a mask, each time generating 64 bits. (~4.63 c/byte)

I was expected the last method to be significantly faster, but apparently not. Even though the loop is heavily unrolled and the pdep instructions should be executable in parallel (throughput of 1, latency of 3 cycles), and shifts are fairly fast, it's bottlenecking somewhere else that I can't identify. Given the fairly regular form of output, I thought there could be a clever bithack to do so (esp. on AVX2, where we have vpshll and brethren), but I couldn't think of a particularly good one. What is the fastest way to perform this operation (on large datasets, and assuming 32-byte alignment)?

I’m on Haswell, on a 2.7 GHz Quad-Core Intel Core i7, so I have access to AVX2 and BMI2, but no further. My code:

// gcc triple.c -o triple -march=haswell -g -O2
#include <immintrin.h>
#include <time.h>
#include <stdint.h>
#include <stdio.h>
#include <stdint.h>
#include <limits.h>

#ifndef __AVX2__
#error "AVX2 needed"
#endif

#ifndef __BMI2__
#error "BMI2 needed"
#endif

const size_t SRC_SIZE = 1ULL << 30; // bytes

uint32_t lut[256];
// Probably too big to fit in L1d...
uint64_t lut16[65536];

// Reference implementation, working byte by byte (~7.200 cycles / byte)
void basic_stretch_data(void* src, size_t bytes, void* dst) {
    uint8_t* src_c = src;
    uint8_t* dst_c = dst;

    for (size_t i = 0; i < bytes; ++i) {
    uint32_t t = lut[src_c[i]]; // lut filled below

    dst_c[3*i] = (t & 0xff);
    dst_c[3*i+1] = (t & 0xff00) >> 8;
    dst_c[3*i+2] = (t & 0xff0000) >> 16;
    }
}

// Attempts to be a bit more clever, working in 16-bit chunks and writing in 64-bit chunks (~5.01 cycles / byte).
// Expects 8-byte alignment.
void improved_stretch_data(void* src, size_t bytes, void* dst) {
    uint16_t* src_c = src;
    uint64_t* dst_c = dst;

    for (size_t i = 0; i < bytes / 2; i += 4) {
    // 48 bytes per cycle
    uint64_t t1 = lut16[src_c[i]]; // lut filled below
    uint64_t t2 = lut16[src_c[i+1]];
    uint64_t t3 = lut16[src_c[i+2]];
    uint64_t t4 = lut16[src_c[i+3]];

    dst_c[3*i/4] = t1 | ((t2 & 0xffff) << 48);
    dst_c[3*i/4+1] = ((t2 & 0xffffffff0000) >> 16) | ((t3 & 0xffffffff) << 32);
    dst_c[3*i/4+2] = ((t3 & 0xffff00000000) >> 32) | (t4 << 16);
    }
}

// Every three bits toggled, in one of three locations
#define BIT_MASK_1 0x9249249249249249ULL
#define BIT_MASK_2 (BIT_MASK_1 >> 1)
#define BIT_MASK_3 (BIT_MASK_1 >> 2)

// Expects 32-byte alignment. ~3.02 cycles / byte
void pdep_stretch_data(void* src, size_t bytes, void* dst) { 
    uint64_t* src_c = src;
    uint64_t* dst_c = dst;

    uint64_t k;

    for (size_t i = 0; i < bytes / 8; ) {
    // Output 96 bytes per loop 
    uint64_t k = src_c[i];

#define WRITE_PARTS  \
    k = src_c[i]; \
    dst_c[3*i] = _pdep_u64(k, BIT_MASK_1);            /* first 22 bits (little endian) */ \
    dst_c[3*i+1] = _pdep_u64(k >> 22, BIT_MASK_2);       /* middle 21 bits */ \
    dst_c[3*i+2] = _pdep_u64(k >> 43, BIT_MASK_3);       /* last 21 bits */ \
    i += 1;

    WRITE_PARTS
    WRITE_PARTS 
    WRITE_PARTS
    WRITE_PARTS
    }
}

#define DEFINE_TRIPLIFY(name, IntType, Out) Out name(IntType b) { \
    Out s = 0; \
    for (int i = 0; i < sizeof(IntType) * CHAR_BIT; ++i) { \
    s |= (Out)(b & (1 << i)) << (2 * i); \
    } \
    return s; \
}

DEFINE_TRIPLIFY(triplify8, uint8_t, uint32_t)
DEFINE_TRIPLIFY(triplify16, uint16_t, uint64_t)

void fill_triplify_lut() {
    uint8_t b = 0;
    do {
    lut[b] = triplify8(b);
    } while (++b != 0);
}

void fill_triplify_lut16() {
    uint16_t b = 0;

    do {
    lut16[b] = triplify16(b); 
    } while (++b != 0);
}

clock_t c;
void clock_start() {
    c = clock();
}

void clock_end(const char* msg) {
    printf("Clock %s took %f s\n", msg, (double)(clock() - c) / CLOCKS_PER_SEC);
}

int main() {
    fill_triplify_lut();
    fill_triplify_lut16();

    char* src = aligned_alloc(32, SRC_SIZE);
    char* dst = aligned_alloc(32, 3 * SRC_SIZE);

    printf("Initializing data\n");

    for (size_t i = 0; i < SRC_SIZE; ++i) {
    src[i] = i;
    }

    printf("Initialized data\n");

    // Stretch data so that each bit becomes three bits in the result, with two of them 0
    clock_start(); 
    pdep_stretch_data(src, SRC_SIZE, dst);
    clock_end("pdep finished");
   
    char* dst2 = aligned_alloc(32, 3 * SRC_SIZE);

    clock_start(); 
    basic_stretch_data(src, SRC_SIZE, dst2);
    clock_end("basic finished");

    // Validate
    for (size_t i = 0; i < 3 * SRC_SIZE; ++i) {
    if (dst[i] != dst2[i]) {
        printf("Results do not match at byte %zu. Expected: %i; got: %i.\n", i, dst2[i], dst[i]);
        return 0;
    }
    }
}

Disassembly

Second method:

improved_stretch_data:                  # @improved_stretch_data
        cmp     rsi, 2
        jb      .LBB1_3
        shr     rsi
        xor     r9d, r9d
        vmovdqa xmm0, xmmword ptr [rip + .LCPI1_0] # xmm0 = [16,32]
        vpxor   xmm1, xmm1, xmm1
        vmovdqa xmm2, xmmword ptr [rip + .LCPI1_1] # xmm2 = [32,16]
        xor     ecx, ecx
.LBB1_2:                                # =>This Inner Loop Header: Depth=1
        movzx   r8d, word ptr [rdi + 2*rcx]
        movzx   eax, word ptr [rdi + 2*rcx + 2]
        mov     rax, qword ptr [8*rax + lut16]
        movzx   r10d, word ptr [rdi + 2*rcx + 4]
        vmovq   xmm3, rax
        shl     rax, 48
        or      rax, qword ptr [8*r8 + lut16]
        movzx   r8d, word ptr [rdi + 2*rcx + 6]
        vmovq   xmm4, qword ptr [8*r10 + lut16] # xmm4 = mem[0],zero
        vmovq   xmm5, qword ptr [8*r8 + lut16]  # xmm5 = mem[0],zero
        mov     qword ptr [rdx + 2*r9], rax
        vpunpcklqdq     xmm3, xmm3, xmm4        # xmm3 = xmm3[0],xmm4[0]
        vpsrlvq xmm3, xmm3, xmm0
        vpblendw        xmm3, xmm1, xmm3, 19            # xmm3 = xmm3[0,1],xmm1[2,3],xmm3[4],xmm1[5,6,7]
        vpunpcklqdq     xmm4, xmm4, xmm5        # xmm4 = xmm4[0],xmm5[0]
        vpsllvq xmm4, xmm4, xmm2
        vpor    xmm3, xmm4, xmm3
        vmovdqu xmmword ptr [rdx + 2*r9 + 8], xmm3
        add     rcx, 4
        add     r9, 12
        cmp     rcx, rsi
        jb      .LBB1_2
.LBB1_3:
        ret

Third method:

pdep_stretch_data:                      # @pdep_stretch_data
        push    r15
        push    r14
        push    rbx
        cmp     rsi, 8
        jb      .LBB2_3
        shr     rsi, 3
        mov     eax, 9
        xor     r15d, r15d
        movabs  r8, -7905747460161236407
        movabs  r9, 5270498306774157604
        movabs  r10, 2635249153387078802
.LBB2_2:                                # =>This Inner Loop Header: Depth=1
        mov     rcx, qword ptr [rdi + 8*r15]
        pdep    r14, rcx, r8
        lea     r11, [8*rax]
        mov     qword ptr [rdx + 8*rax - 72], r14
        mov     rbx, rcx
        shr     rbx, 22
        pdep    rbx, rbx, r9
        mov     qword ptr [rdx + 8*rax - 64], rbx
        shr     rcx, 43
        pdep    rcx, rcx, r10
        mov     qword ptr [rdx + 8*rax - 56], rcx
        mov     rcx, qword ptr [rdi + 8*r15 + 8]
        pdep    rbx, rcx, r8
        mov     qword ptr [rdx + 8*rax - 48], rbx
        mov     rbx, rcx
        shr     rbx, 22
        pdep    rbx, rbx, r9
        mov     qword ptr [rdx + 8*rax - 40], rbx
        shr     rcx, 43
        pdep    rcx, rcx, r10
        mov     qword ptr [rdx + 8*rax - 32], rcx
        mov     rcx, qword ptr [rdi + 8*r15 + 16]
        pdep    rbx, rcx, r8
        mov     qword ptr [rdx + 8*rax - 24], rbx
        mov     rbx, rcx
        shr     rbx, 22
        pdep    rbx, rbx, r9
        mov     qword ptr [rdx + 8*rax - 16], rbx
        shr     rcx, 43
        pdep    rcx, rcx, r10
        mov     qword ptr [rdx + 8*rax - 8], rcx
        mov     rcx, qword ptr [rdi + 8*r15 + 24]
        pdep    rbx, rcx, r8
        mov     qword ptr [rdx + 8*rax], rbx
        mov     rbx, rcx
        shr     rbx, 22
        pdep    rbx, rbx, r9
        mov     qword ptr [rdx + 8*rax + 8], rbx
        shr     rcx, 43
        pdep    rcx, rcx, r10
        or      r11, 16
        mov     qword ptr [rdx + r11], rcx
        add     r15, 4
        add     rax, 12
        cmp     r15, rsi
        jb      .LBB2_2
.LBB2_3:
        pop     rbx
        pop     r14
        pop     r15
        ret

`_mm256_set_epi64x` isn't fast! Since you're running less than 1 `pdep` per clock anyway, probably best to just do scalar stores. (Use `memcpy` to make it strict-aliasing / alignment safe, if necessary.) Yeah, the asm you show has 4x `vmovq xmm, reg` and 2x `vpunpcklqdq` all competing for port 5 on your Haswell (https://uops.info/). (Ice Lake added another shuffle unit that can handle some shuffles, but it's on port 1 where it would compete with pdep.) Luckily for you, GCC did the store in two XMM halves instead of another shuffle (`vinserti128`). — Peter Cordes, May 15 '22 at 18:35
BTW, stretching bits by a factor of *2* can be done with carryless multiply like `vpclmulqdq xmm1, xmm0, xmm0, 0` as in [The interstice of two binary numbers](https://codegolf.stackexchange.com/a/245585) to get 128 bits from 64 bits. But I don't know a way to spread by 3. — Peter Cordes, May 15 '22 at 18:46
Done. DIdn't know _mm256_set_epi64x compiled to a sequence, but in hindsight that makes a lot more sense. The timing doesn't seem to have measurably changed, though. — Ovinus Real, May 15 '22 at 18:47
Ye, I saw that for factor of 2. Makes quite a bit of sense. The ultimate goal for me is to be able to rapidly generate this data for any arithmetic progression ax+b (where here (a,b)=(3,0)) with a,b <= 16, roughly. a=2^k seems fairly straightforward, but a=3 is the first frustrating case. — Ovinus Real, May 15 '22 at 18:49
If this could be made fast enough, it's something you could do on the fly instead of making a less-compact copy of the same data. Or at least as you output something else. Your benchmark uses huge arrays, and may even be touching them for the first time inside your timed regions. (So you're timing page faults; that would explain not getting a speedup from changing `_mm_set` to just scalar stores. [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)) — Peter Cordes, May 15 '22 at 18:56
Good point; and indeed, the storage is solely for verifying correctness. When I'm back on my computer I'll try 1. touching the memory first and 2. removing the stores altogether. Does memset work to do the former, or is it handled specially? — Ovinus Real, May 15 '22 at 19:00
Yeah, reducing to 1MiB instead of 1GiB and slapping a 1024 iter repeat loop around the function calls is the same amount of work, but sped it up pdep from 1.13s to 0.144s on my 3.9GHz Skylake. Previously 1.13 vs. 1.66s, now 0.144 vs. 0.93 s. So it's a huge speedup, and a big *relative* gain vs. the LUT. https://godbolt.org/z/Pb4nn83aE is the code I timed, using `__attribute__((noinline,noipa))` to make it safe to use a repeat loop without inlining and hoisting defeating the benchmark. 1024 repeat iters amortizes the page faults sufficiently; ideally touch the output arrays first. — Peter Cordes, May 15 '22 at 19:00
memset does work to touch the memory first, unless possibly a `0` value gets optimized away. (GCC will optimize malloc + memset(0) into calloc, but I don't think there's a version of `aligned_alloc` that guarantees zeroed memory, ironically since the use-case is often big arrays so it's likely just going straight to the kernel). But 1GiB is still huge and doesn't fit in L3 cache, so it's much better to loop over a smaller array for perf testing. Verification can be separate. (And could be done in chunks that fit in cache, to make exhaustive testing for 32-bit bit-patterns efficient.) — Peter Cordes, May 15 '22 at 19:08
Fascinating stuff, thanks a ton. I'm getting similar results for 1 MiB as well. Why does L3 cache matter with huge amounts of data? After using memset, with such a regular access pattern, shouldn't hardware prefetching make that a non-issue? — Ovinus Real, May 15 '22 at 19:22
The major effect here is that page faults from the first access to a virtual memory page are amortized over 1024 passes, instead of 1, making the cost negligible even without bothering with an init loop to touch the memory before the timed region. If we *did* pre-fault the memory, it might well run almost as fast with 1 pass over a huge array as 1024 passes over small arrays, but possibly be more sensitive to memory bandwidth competition from anything else on the system. — Peter Cordes, May 15 '22 at 19:29
Picking sizes that fit in L3 or L2 cache for a loop like this makes sure memory bandwidth isn't a bottleneck, since we want to optimize an ALU strategy. Non-NT stores are effectively a read (RFO) + write, and this could possibly come close to 1 pdep/clock, storing 8 bytes per clock cycle, or 32GB/s at 4GHz of actual stores, or 64GB/s of total bandwidth including RFOs. And that's not counting reading the input array. Even DDR4-3200 isn't that fast, let alone Haswell DDR3. Even coming close to a memory bottleneck can sometimes stall the pipeline, costing some throughput. — Peter Cordes, May 15 '22 at 19:36
If you had a slower function, then yeah HW prefetch would keep up and avoid memory bottlenecks for the loads and stores, giving L1d hits for loads. At worst maybe limit how far ahead OoO exec could see if store-buffer entries don't get freed as quickly. But with a fast loop, it's easily possible to run faster than dual-channel DRAM can keep up with on a desktop, or than the single-core memory bandwidth on a big Xeon. — Peter Cordes, May 15 '22 at 19:41
If all the stores were removed, do you think it would keep up? In any case, I'm pretty sure the pdep method is plenty fast enough for my purposes, so I'll edit my question to reflect that. — Ovinus Real, May 15 '22 at 19:45
If all the stores were removed *in the asm*, then reading 8 bytes per clock cycle at 2.7GHz is 21.7GB/s. That's maybe doable. But big arrays can also have TLB misses. Of course, if all the assignments to the output were removed in C, the work would optimize away and you'd be timing an empty loop, or no loop. I don't get why you don't want to just loop over an array that fits in cache. — Peter Cordes, May 15 '22 at 19:49
For the benchmark, of course, and my question is answered. For the actual application, I'm trying to merge several streams of this type of data (in various arithmetic progressions) with a vertical bitwise OR, so I'm just trying to get a sense of whether memory will ultimately be the bottleneck (at which point I can't do much). I think it probably will be, but that's not super relevant to this particular post. — Ovinus Real, May 15 '22 at 19:54
Just for fun, I timed a version where `pdep_stretch_data` just copied `dst_c[3*i+0] = dst_c[3*i+1] = dst_c[3*i+2] = src_c[i]`, same memory access pattern, but zero work. It ran the same speed as your current version, with a 1MiB size. Lowering it to `1ULL<<13` or smaller (thus fitting in L1d) with a repeat count of `1ULL<<17`, sped up from 0.144s to 0.045 s on my Skylake CPU, for the no-work just replicate qwords version. Even L1d vs. L2 cache made a difference here (0.112s for size = 1<<14), and SKL has a faster L1d <-> L2 connection than Haswell. — Peter Cordes, May 15 '22 at 20:00
Ok, for your real use-case, it might be a significant win to make some versions of a loop that can read two streams in a pattern that keeps one output aligned, so you're calculating in registers and only storing OR results, not re-reading streams later even if that's with SIMD. Like 2x3 outputs from pdep, and 3x2 outputs from pclmul to expand by 2. Possibly then `_mm_set_epi64x()` and `_mm_or_si128` to use the clmul results. If you don't go that far, at least work in chunks (cache blocking) so you're reading back results from L1d cache, so you can do 2x 32-byte loads per clock. — Peter Cordes, May 15 '22 at 20:05
Letting these individually expanded temporary arrays go to DRAM and back would be horrible, and even L3 and back isn't great. — Peter Cordes, May 15 '22 at 20:06
Indeed, in the full application there's no temporary writes; I unrolled the loop to the lcm of the progressions, so everything is nicely aligned. As an aside, it's interesting how versatile of an instruction pdep is, in the sense that I have no clue how to efficiently emulate it for even slightly unusual mask values. Somewhat annoying because the computer I usually use is pre-Haswell and doesn't even have AVX2, let alone BMI2, so I'm using my laptop for this project.... — Ovinus Real, May 15 '22 at 20:10
Oops, my earlier test number for just copying, no pdep, were from GCC auto-vectorizing with a shuffle. With `gcc -O3 -march=skylake -funroll-loops -fno-tree-vectorize`, scalar code that doesn't suffer L1d misses (size=1<<12) runs 0.104 seconds, about 1.5x faster than L3 hits. But with size=1<<15 (32kiB, so total cache footprint 128KiB), I still get a slowdown, running at 0.125 s for the same total amount of work. — Peter Cordes, May 15 '22 at 20:11
Ok, yeah if you're avoiding temporary writes, then it's also relevant how it interleaves with other code, e.g. if the botlteneck is front-end, or some back-end port. Timing it on its own writing nowhere can be interesting but doesn't tell you how well the two pieces of work are going to overlap with each other. — Peter Cordes, May 15 '22 at 20:16
BTW, you *can* get a C compiler to do the PDEP work without storing, using inline asm to force it to materialize a value in a register. Like `asm volatile("" : "r"(val));` A `"g"` constraint would also allow memory or immediate, but clang unfortunately always picks memory when you let it. See [Preventing compiler optimizations while benchmarking](https://stackoverflow.com/q/40122141) / [Google Benchmark Frameworks DoNotOptimize](//stackoverflow.com/q/66795357) / [I don't understand the definition of DoNotOptimizeAway](//stackoverflow.com/q/52203710) for similar versions of the same idea. — Peter Cordes, May 15 '22 at 20:20

Fast SIMD bit deposition in a regular pattern

0 Answers0