Interleave bits efficiently

Question

I need to make uint64_t out of 2 uint32_t interleaving the bits: if A=a0a1a2...a31 and B=b0b1...b31, I need C=a0b0a1b1...a31b31. Is there a way to do this efficiently? So far I've got only the naive approach with a for loop of 32 iterations, where each iteration does C|=((A&(1<<i))<<i)|((B&(1<<i))<<(i+1)).

I guess there should be some mathematical trick like multiplying A and B by some special number which results in interleaving their bits with zeros in the resulting 64-bit number, so that what only remains is to or these products. But I can't find such multiplier.

Another potential way to go is a compiler intrinsic or assembly instruction, but I don't know of such.

[This](https://graphics.stanford.edu/~seander/bithacks.html#InterleaveBMN) looks like it may work. Not sure if or how it scales though. — NathanOliver, Sep 14 '16 at 12:25
It should scale fairly easily; add an initial shift by 16 and mask by `0x0000ffff0000ffff`, and extend all the other masks appropriately. So the new first step severs and moves the high 16 bits into the low part of the topmost word. — Tommy, Sep 14 '16 at 12:40
Can you use x86's BMI2 instruction set? Two 64-bit [PDEP with opposite masks](http://www.felixcloutier.com/x86/PDEP.html) and one ordinary bitwise-OR should do the trick. The intrinsic for PDEP is `_pdep_u64()` — Peter Cordes, Sep 14 '16 at 13:50
@PeterCordes, that's a superior theoretical solution, but in practice I don't have BMI2/PDEP neither in Phenom 2 processor, nor in CUDA. — Serge Rogatch, Sep 14 '16 at 18:47

score 13 · Accepted Answer · edited Aug 20 '20 at 22:19

NathanOliver's link offers the 16-bit -> 32-bit implementation:

static const unsigned int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF};
static const unsigned int S[] = {1, 2, 4, 8};

unsigned int x; // Interleave lower 16 bits of x and y, so the bits of x
unsigned int y; // are in the even positions and bits from y in the odd;
unsigned int z; // z gets the resulting 32-bit Morton Number.  
                // x and y must initially be less than 65536.

x = (x | (x << S[3])) & B[3];
x = (x | (x << S[2])) & B[2];
x = (x | (x << S[1])) & B[1];
x = (x | (x << S[0])) & B[0];

y = [the same thing on y]

z = x | (y << 1);

Which works by:

leave the low 8 bits of x where they are. Move the high 8 bits up by 8;
divide in half and do the same thing, this time leaving the low pairs of 4 bits where they are and moving the others up by 4;
and again, and again.

I.e. it proceeds as:

   0000 0000 0000 0000  abcd efgh ijkl mnop
-> 0000 0000 abcd efgh  0000 0000 ijkl mnop
-> 0000 abcd 0000 efgh  0000 ijkl 0000 mnop
-> 00ab 00cd 00ef 00gh  00ij 00kl 00mn 00op
-> 0a0b 0c0d 0e0f 0g0h  0i0j 0k0l 0m0n 0o0p

And then combines the two inputs together.

As per my earlier comment, to extend that to 64 bits, just add an initial shift by 16 and mask by 0x0000ffff0000ffff, either because you can intuitively follow the pattern or as a divide-and-conquer step, turning the 32-bit problem into two non-overlapping 16-bit problems and then using the 16-bit solution.

Divide&conquer approach doubles the number of required operations, while extension with the 5th array item and `or-and-assign`, if it works, increases the operation count by 1.25x . — Serge Rogatch, Sep 14 '16 at 17:37
Divide and conquer doesn't double in this case because the 32-bit value, divided into two 16-bit quantities, fits in a 64-bit word throughout, which your CPU can act upon atomically. But we're arguing semantics. I count it as logically an initial division, even though the divided thing continues to fit inside one scalar. — Tommy, Sep 14 '16 at 18:21
thanks, I misunderstood initially. I though you mean performing these 4 `or-and-assign` operations first on lower 32 bits, then on higher 32 bits. — Serge Rogatch, Sep 14 '16 at 18:42
@J.Schultke The question was already tagged `[c++]` so that's the default code formatting for unmarked blocks, without `` comments or backticks language. I don't think your edit changed the syntax highlighting of the first code block at all. The other minor changes were improvements so not rolling it back, but in future you can make your edits less noisy by keeping that in mind and looking for actual bad highlighting. — Peter Cordes, Aug 21 '20 at 00:26

saolof · Answer 2 · 2020-07-29T19:58:07.147

11

For larger integers, it's worth mentioning the clmul x86 extension for finite field multiplication (carryless multiplication). Interleaving an integer with zeros is equivalent to a carryless multiplication of the integer with itself, which is a single ALU instruction.

edited Jul 29 '20 at 19:58

answered Jul 29 '20 at 19:45

saolof

1,097
1
15
12

Good choice, and with 64 bit input / 128 bit output it's even pretty much ideal. But it's also worth noting thay this is *currently* still a somewhat "new" extension and will not be available on AMD Phenom II and Intels Ivy Bridge processors which you may still encounter in the wild. – Ext3h Jul 29 '20 at 19:57
@Ext3h: https://www.felixcloutier.com/x86/pclmulqdq is available since Westmere for Intel (2nd-gen Nehalem). https://godbolt.org/z/jWaGWv. But yes, it's an extension, not baseline x86-64. – Peter Cordes Jul 29 '20 at 20:00
1

Since we're mentioning x86 extensions, BMI2 [`pdep r64, r64, r/m64`](https://www.felixcloutier.com/x86/pdep) is efficient on Intel (1 uop, 3c latency) but not AMD. It can expand a 32-bit to a 64-bit integer with zeros in between. (BMI2 is available on Haswell and later, and on AMD Zen and later, but `pdep`/`pext` are very slow on Zen) – Peter Cordes Jul 29 '20 at 20:37

score 3 · Answer 3 · answered Sep 14 '16 at 12:44

3

Would a short, precalculated array lookup count as a "mathematical trick"?

Precalculate an array of 256 uint16_ts:

static const uint16_t lookup[256]={0x0000, 0x0001, 0x0005 ..., 0x5555};

We can interleave two eight-bit values, and come up with a 16 bit value easily:

uint16_t interleave(uint8_t a, uint8_t b)
{
    return (lookup[a] << 1) | lookup[b];
}

How to extend this to interleave two 32-bit values into a 64-bit value should be obvious: call this four times, for each one of the four bytes that make up a uint32_t, then << an | the results together. Bribe the compiler to inline the whole thing, and the end result should be fairly quick and cheap.

Since RAM is cheap these days, you might want to consider a precalculated table of 65536 uint32_ts, also.

answered Sep 14 '16 at 12:44

Sam Varshavchik

114,536
5
94
148

12

RAM is cheap; cache-misses aren't. Most programs area already bottlenecked on memory more than CPU. This might win a microbenchmark where no other code uses up your cache footprint, but it won't win in real life unless this operation happens in bursts in a really tight loop. – Peter Cordes Sep 14 '16 at 13:57
1

It's still a good idea in some cases. I need to interleave two bytes, but doing it through 2 lookups of the upper and lower 4-bit nibbles of each byte might be worth it. – Mads Y Jan 03 '20 at 14:02
@MadsY: Indeed, 4-bit LUTs (16x 1-byte entries) are small enough to actually stay hot in cache reliably, and can be a good tradeoff between memory and computation (and + shift to isolate each nibble, OR to combine). – Peter Cordes Jul 29 '20 at 19:49

Ovinus Real · Answer 4 · 2022-07-20T06:06:38.673

To make saolof's answer concrete, the following is an implementation using the CLMUL instruction set, interleaving two pairs of uint32_ts per call:

#include <immintrin.h>
#include <stdint.h>

typedef struct {
  uint32_t x;
  uint32_t y;
} uint32_2;

static inline void interleave_clmul(uint32_2 *input, uint64_t *out) {
  __m128i xy = _mm_load_si128((const __m128i *)input);

  xy = _mm_shuffle_epi32(xy, 0b11011000);

  // Two carryless multiplies
  __m128i p2 = _mm_clmulepi64_si128(xy, xy, 0x11);
  __m128i p1 = _mm_clmulepi64_si128(xy, xy, 0x00);

  // Bitwise interleave the results
  p2 = _mm_slli_epi16(p2, 1);
  __m128i p3 = _mm_or_si128(p1, p2);

  _mm_storeu_si128((__m128i *)out, p3);
}

That compiles down to the following:

interleave_clmul(uint32_2*, unsigned long*):
        vpshufd         xmm0, xmmword ptr [rdi], 216    # xmm0 = mem[0,2,1,3]
        vpclmulqdq      xmm1, xmm0, xmm0, 17
        vpclmulqdq      xmm0, xmm0, xmm0, 0
        vpaddw          xmm1, xmm1, xmm1
        vpor            xmm0, xmm0, xmm1
        vmovdqu         xmmword ptr [rsi], xmm0
        ret

Replace _mm_load_si128 with _mm_loadu_si128 if your data is not aligned--unaligned loads aren't that much slower on x86 anyway. This system is faster than the corresponding implementation with pdep instructions in terms of throughput.

Naive
Total rdtsc: 1295559857, iterations: 1000, count: 10000
clmul-based
Total rdtsc: 17751716, iterations: 1000, count: 10000
pdep-based
Total rdtsc: 26123417, iterations: 1000, count: 10000
pdep-based unrolled
Total rdtsc: 24281811, iterations: 1000, count: 10000

Turbo boost was disabled; CPU is a 1.60 GHz base clock Kaby Lake. Results seem consistent across runs. (Results on other architectures would be nice.) The (somewhat messy) testing code:


#include <stdio.h>
#include <inttypes.h>
#include <string.h>
#include <stdlib.h>
#include <immintrin.h>
// rdtscp
#include <x86intrin.h>

typedef struct uint32_2 {
    uint32_t x;
    uint32_t y;
} uint32_2;


uint32_2* generate_pairs(const int count) {
    uint32_2* p = malloc(count * sizeof(uint32_2));

    uint32_t r = 401923091;
#define RNG r *= 52308420; r += 2304;

    for (int i = 0; i < count; ++i) {
        p[i].x = r;
        RNG RNG
        p[i].y = r;
        RNG RNG
    }

    return p;
}

void interleave_naive(uint64_t* dst, uint32_2* src, int count) {
    for (int i = 0; i < count; ++i) {
        struct uint32_2 s = src[i];
        uint32_t x = s.x, y = s.y;

        uint64_t result = 0;
        for (int k = 0; k < 32; ++k) {
            if (x & ((uint32_t)1 << k)) {
                result |= (uint64_t)1 << (2 * k);
            }

            if (y & ((uint32_t)1 << k)) {
                result |= (uint64_t)1 << (2 * k + 1);
            }
        }

        dst[i] = result;
    }
}

void interleave_pdep(uint64_t* dst, uint32_2* src, int count) {
    for (int i = 0; i < count; ++i) {
        struct uint32_2 s = src[i];
        uint32_t x = s.x, y = s.y;

        uint64_t result = _pdep_u64(x, 0x5555555555555555) | _pdep_u64(y, 0xaaaaaaaaaaaaaaaa);

        dst[i] = result;
    }
}

void interleave_pdep_unrolled(uint64_t* dst, uint32_2* src, int count) {
    for (int i = 0; i < count; i += 2) {
        struct uint32_2 s1 = src[i];
        struct uint32_2 s2 = src[i + 1];

        uint32_t x1 = s1.x, y1 = s1.y;
        uint32_t x2 = s2.x, y2 = s2.y;

        uint64_t result1 = _pdep_u64(x1, 0x5555555555555555) | _pdep_u64(y1, 0xaaaaaaaaaaaaaaaa);
        uint64_t result2 = _pdep_u64(x2, 0x5555555555555555) | _pdep_u64(y2, 0xaaaaaaaaaaaaaaaa);

        dst[i] = result1;
        dst[i + 1] = result2;
    }
}

void interleave_clmul(uint64_t* dst, uint32_2* src, int count) {
    uint32_2* end = src + count;
    uint64_t* out = dst;

    for (uint32_2* p = src; p < end; p += 2, out += 2) {
        __m128i xy = _mm_load_si128((const __m128i *) p);

        xy = _mm_shuffle_epi32(xy, 0b11011000);

        __m128i p2 = _mm_clmulepi64_si128(xy, xy, 0x11);
        __m128i p1 = _mm_clmulepi64_si128(xy, xy, 0x00);

        p2 = _mm_slli_epi16(p2, 1);
        __m128i p3 = _mm_or_si128(p1, p2);

        _mm_store_si128((__m128i *)out, p3);
    }
}

#define ITERATIONS 1000

void time_inv(uint32_2* pairs, int count, void (*interleave) (uint64_t*, uint32_2*, int)) {
    uint64_t* result = malloc(count * sizeof(uint64_t));
    uint64_t* reference_result = malloc(count * sizeof(uint64_t));

    interleave_naive(reference_result, pairs, count);
    // Induce page faults
    memset(result, 0, count * sizeof(uint64_t));

    unsigned _;
    int64_t start_rdtsc = __rdtscp(&_);
    for (int i = 0; i < ITERATIONS; ++i) {
        interleave(result, pairs, count);
    }
    int64_t end_rdtsc = __rdtscp(&_);

    for (int i = 0; i < count; ++i) {
        if (reference_result[i] != result[i]) {
            fprintf(stderr, "Incorrect value at index %d; got %" PRIx64 ", should be %" PRIx64 "\n", i, result[i], reference_result[i]);
            abort();
        }
    }

    printf("Total rdtsc: %" PRId64 ", iterations: %d, count: %d\n", end_rdtsc - start_rdtsc, ITERATIONS, count);

    free(result);
}

int main() {
    const int count = 10000;
    uint32_2* pairs = generate_pairs(count);

    printf("Naive\n");
    time_inv(pairs, count, interleave_naive);
    printf("clmul-based\n");
    time_inv(pairs, count, interleave_clmul);
    printf("pdep-based\n");
    time_inv(pairs, count, interleave_pdep);
    printf("pdep-based unrolled\n");
    time_inv(pairs, count, interleave_pdep_unrolled);

    free(pairs);
}

Compile with gcc bleh.c -o bleh -O2 -march=native.

Perf stats for each implementation are below. CLMUL seems to do 1.5c/pair, bottlenecking on port 5 contention by 2 pclmulqdq and 1 vpshufd, while the pdep implementations bottleneck on port 1, on which pdep executes, resulting in 2c/pair:

clmul-based
Total rdtsc: 1774895925, total milliseconds: 985.575000, iterations: 100000, count: 10000

 Performance counter stats for './interleave':

       556,602,052      uops_dispatched_port.port_0                                     (49.87%)
     1,556,592,314      cycles                                                        (49.86%)
       469,021,017      uops_dispatched_port.port_1                                     (49.86%)
       472,968,452      uops_dispatched_port.port_2                                     (50.08%)
       519,804,531      uops_dispatched_port.port_3                                     (50.13%)
       499,980,587      uops_dispatched_port.port_4                                     (50.14%)
     1,509,928,584      uops_dispatched_port.port_5                                     (50.14%)
     1,484,649,884      uops_dispatched_port.port_6                                     (49.92%)

pdep-based
Total rdtsc: 2588637876, total milliseconds: 1438.065000, iterations: 100000, count: 10000

 Performance counter stats for './interleave':

       745,844,862      uops_dispatched_port.port_0                                     (50.02%)
     2,289,048,624      cycles                                                        (50.02%)
     2,033,116,738      uops_dispatched_port.port_1                                     (50.02%)
     1,508,870,090      uops_dispatched_port.port_2                                     (50.02%)
     1,498,920,409      uops_dispatched_port.port_3                                     (49.98%)
     1,056,089,339      uops_dispatched_port.port_4                                     (49.98%)
       843,399,033      uops_dispatched_port.port_5                                     (49.98%)
     1,414,062,891      uops_dispatched_port.port_6                                     (49.98%)

pdep-based unrolled
Total rdtsc: 2387027127, total milliseconds: 1325.857000, iterations: 100000, count: 10000

 Performance counter stats for './interleave':

       532,577,450      uops_dispatched_port.port_0                                     (49.64%)
     2,099,782,071      cycles                                                        (49.94%)
     2,004,347,972      uops_dispatched_port.port_1                                     (50.24%)
     1,532,203,395      uops_dispatched_port.port_2                                     (50.54%)
     1,467,988,364      uops_dispatched_port.port_3                                     (50.36%)
     1,701,095,132      uops_dispatched_port.port_4                                     (50.06%)
       543,597,866      uops_dispatched_port.port_5                                     (49.76%)
       930,460,812      uops_dispatched_port.port_6                                     (49.46%)

It's non-obvious that CLMUL would be faster on Haswell; it's 3 uops there, with 2c throughput (`2*p0+1*p5` - https://uops.info/) to produce 128 bits of output vs. `pdep` being 1 uop with 1c throughput (3c latency) to produce 64 bits. (And an `lea` to interleave, if you write the source with `pdep(x,mask) + (pdep(y,mask)<<1)`. For `|` to be fast, you'd need 2 different pdep constants to avoid a shift). For a single pair of 32-bit inputs, we only need a 64-bit constant, and 2x `pdep` + `lea`. For a big array, I'd expect about equal. — Peter Cordes, Jul 20 '22 at 00:54
CLMUL is great on Broadwell and later (1 uop, 1c throughput), or 3 uops (2c throughput) for the YMM/ZMM version on Ice Lake. Then down to 1 uop again for the YMM/ZMM version on Alder Lake. On AMD, it's 4 uops (2c throughput not fully pipelined) for the XMM version on Zen1/2/3 (and the YMM version which is new in Zen3). https://uops.info/. But `pdep` is disastrously slow on Zen 1/2, only having proper hw support in Zen3. — Peter Cordes, Jul 20 '22 at 01:04
@PeterCordes I encourage you to benchmark it; I don't remember this answer well and I might have only used _pdep_u32, which would be unfortunate. — Ovinus Real, Jul 20 '22 at 03:25
I don't have a Haswell to benchmark on, only a Skylake. But `pdep` performance is the same on SKL, so I could look at cycles per pair. 2 cycles per pair of uint32_t inputs would be the theoretical best case. (2x load + 2x pdep + lea + store is 6 uops even without loop overhead, but has 2c throughput so there's room for loop overhead.) 2.27 sounds reasonable in practice. — Peter Cordes, Jul 20 '22 at 03:37
What's surprising is your reported `pclmul` throughput: you need 2x pclmul per iter (2 pairs), which on Haswell *should* bottleneck you at one iter (2 pairs) per 4 cycles, or 1 pair per 2 cycles. I don't see how 0.9 is plausible. — Peter Cordes, Jul 20 '22 at 03:39
Hm. Turns out Daniel Lemire's harness code just uses rdtsc/rdtscp, which iirc is not necessarily in step with actual cycle count. I'll have to retest the program with perf counters at some point. — Ovinus Real, Jul 20 '22 at 03:40
Ah, yeah I was just about to comment that perhaps non-integral counts were due to measuring reference cycles, not core clock cycles. Was this perhaps on a low-power laptop where max turbo is over twice the reference clock frequency? So after warm-up, you could be running 2 core cycles for every 1 that RDTSC was counting? — Peter Cordes, Jul 20 '22 at 03:43
Yep, a laptop with a base clock of 2.7 GHz and a max turbo of 4.5. The rdtsc frequency is usually equal to the CPU base frequency, right? So I can just turn turbo off to get actual cycle count? Also, is there a nice library for proper microbenchmarking? I wrote a simple harness a few weeks ago that uses Linux's perf counters but it obviously doesn't work on macOS, which has comically confusing perf counting APIs.... — Ovinus Real, Jul 20 '22 at 03:47
Yes, usually close on older CPUs (before Ice Lake). If you're on Linux, the kernel boot messages should include a TSC calibration measurement, e.g. 4008 MHz on my nominally 4.0 GHz i7-6700k. (`tsc: Refined TSC clocksource calibration: 4008.057 MHz`) Make sure to do some warm-up runs to get it up to speed (and do page faults) before testing. 4.5 / 2.7 is less than 2, but your 0.9c / pair is only explained if it's a bit *more* than 2. [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/q/13772567) has a canonical answer with RDTSC details. — Peter Cordes, Jul 20 '22 at 03:52
@PeterCordes Thanks for all the info! I threw together a more self-contained testing piece of code and ran it on a Kaby Lake laptop, but that has much more efficient clmul. Alas I don't have a Haswell Linux machine. — Ovinus Real, Jul 20 '22 at 04:50
You can calibrate rdtsc yourself by running a loop that will take a known number of iterations (modulo interrupt overhead), like an empty loop with `dec`/`jnz`, or an `imul` dep chain. Or just time in wall-clock real time, and use the known CPU frequency (perhaps with turbo disabled if your laptop won't sustain max turbo for the benchmark duration). That's equivalent; without actually counting core cycles with the PMU, all you can do is make assumptions about the CPU frequency. RDTSC isn't special except for very very short intervals where its low overhead is useful. — Peter Cordes, Jul 20 '22 at 05:18
1.78c/pair seems plausible on SKL/KBL. Since `pclmul` runs on port 5 on that uarch, competing with `pshufd`, there's a port 5 bottleneck of 3 uops per loop iteration (2 pairs). IDK why it wouldn't run at 1.5c/pair; I'd run it under `perf stat` to check your TSC calibration, but maybe it's an L2 (?) cache bandwidth issue, although SKL L2 should be plenty fast enough for an average of 8 bytes loaded+stored per 1.5 cycles. Non-interleaved data (in 2 separate arrays) that didn't need a shuffle could run 1c / pair, I think. — Peter Cordes, Jul 20 '22 at 05:27
Damn you're good; 1.5 cycles it is. Indeed rdtsc and the actual frequency are not the same even with turbo disabled; will investigate tmrw morning. — Ovinus Real, Jul 20 '22 at 06:09
Cool, thanks for checking on real hardware. [Static performance analysis](https://stackoverflow.com/questions/51607391/what-considerations-go-into-predicting-latency-for-operations-on-modern-supersca) often works quite well for cases where branch prediction and cache/memory aren't a factor. (And where tricky front-end effects don't limit us to less than the pipeline width.) For tools to do this automatically, https://uops.info/uiCA.html is often good. (LLVM-MCA and IACA are sometimes also correct.) — Peter Cordes, Jul 20 '22 at 13:49

Interleave bits efficiently

4 Answers4

Linked