53

What I want to do is take a 64-bit unsigned integer consisting of pairs of bits and create from it a 32-bit integer containing 0 if both bits in the corresponding pair are 0 and 1 otherwise. In other words, convert something that looks like :

01 00 10 11

into something that looks like this

1 0 1 1

The two obvious solutions are either the brute force loop or a lookup table for each byte and then do eight lookups and combine them into a final result with OR and bit shifting but I'm sure there should be an efficient means of bit-twiddling this. I will be doing this for a 64-bit integers in C++ but if anyone knows of efficient ways to do this for shorter integers I'm sure I can figure out how to scale it up.

Jack Aidley
  • 19,439
  • 7
  • 43
  • 70
  • 1
    Pretty obvious: if you can spare the whopping 65 kilobytes of memory, you can do it in 4, not 8 lookups, by doing lookup on 16-bit parts. – Bartek Banachewicz Dec 08 '15 at 11:32
  • 2
    @BartekBanachewicz Assuming that reading a LUT is actually faster than brute-forcing. – Rotem Dec 08 '15 at 11:33
  • @Rotem the "brute-forcing" solution would do exactly the same, just with a 2-bit table. – Bartek Banachewicz Dec 08 '15 at 11:34
  • 3
    I rather doubt a 64K lookup table would be faster than a 256-byte lookup table because of the much slower caching. I doubt even a 256-byte lookup table would be particularly fast. – Jack Aidley Dec 08 '15 at 11:36
  • What is the context of this question? Why do you have this data and what do you want to do with it? – Kijewski Dec 08 '15 at 11:41
  • 6
    If your CPU supports `pext` then use selectors 010101... and 101010... , OR the results – M.M Dec 08 '15 at 12:05
  • If you benchmark these solutions, please let us know the outcome. –  Dec 08 '15 at 15:29
  • According to my benchmarks, Yves's optimised lookup and Blastfurnace's bit shifting solutions take almost identical amounts of time (with the Yves's LUT coming out narrowly ahead), while the pext versions of Nils and Harold are both a tad under 10% faster than these solutions but take essentially equal amounts of time to run. – Jack Aidley Dec 08 '15 at 15:53
  • Ah, actually that most of that different is down to it inlining the pext version and not the others. That eliminates most of the difference so the pext version is only around 2-3% faster than the shift version. A naive loop implementation takes about 200x longer than any of these implementations. – Jack Aidley Dec 08 '15 at 16:13
  • @JackAidley: The size of the lookup table doesn't matter, it's the number of cache lines from that table that are read. So it depends on the distribution of data. – gnasher729 Dec 08 '15 at 22:04
  • @JackAidley: Can you post your benchmark test framework? I want to compare some implementations myself. (and some x86 SSE variations on these ideas). – Peter Cordes Dec 09 '15 at 02:53
  • I made some [vectorized versions](http://goo.gl/k3JMYW). I need to re-test after making some tweaks, and try with clang for the multiply version, but so far @Yves's interleave idea (with a 4bit LUT implemented with x86 SSSE3 `pshufb`) is the fastest (throughput doing only that) on Intel Sandybridge, with or without AVX. And it will be exactly the same speed for doing two values in parallel (or 4 with AVX2). I'll post an answer with findings in more detail. I don't have a Haswell to test on, so I can't compare `pext`, but I'm sure it'll be the fastest. Haswell only has half SnB shuffle tput. – Peter Cordes Dec 10 '15 at 07:39
  • @PeterCordes: nice job, thanks. –  Dec 10 '15 at 08:09

7 Answers7

54

Here is a portable C++ implementation. It seems to work during my brief testing. The deinterleave code is based on this SO question.

uint64_t calc(uint64_t n)
{
    // (odd | even)
    uint64_t x = (n & 0x5555555555555555ull) | ((n & 0xAAAAAAAAAAAAAAAAull) >> 1);

    // deinterleave
    x = (x | (x >> 1)) & 0x3333333333333333ull;
    x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0Full;
    x = (x | (x >> 4)) & 0x00FF00FF00FF00FFull;
    x = (x | (x >> 8)) & 0x0000FFFF0000FFFFull;
    x = (x | (x >> 16)) & 0x00000000FFFFFFFFull;

    return x;
}

gcc, clang, and msvc all compile this down to about 30 instructions.

From the comments, there is a modification that can be made.

  • Change the first line to use a single bitmask operation to select only the "odd" bits.

The possibly (?) improved code is:

uint64_t calc(uint64_t n)
{
    // (odd | even)
    uint64_t x = (n | (n >> 1)) & 0x5555555555555555ull; // single bits

    // ... the restdeinterleave
    x = (x | (x >> 1)) & 0x3333333333333333ull; // bit pairs
    x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0Full; // nibbles
    x = (x | (x >> 4)) & 0x00FF00FF00FF00FFull; // octets
    x = (x | (x >> 8)) & 0x0000FFFF0000FFFFull; // halfwords
    x = (x | (x >> 16)) & 0x00000000FFFFFFFFull; // words

    return x;
}
Community
  • 1
  • 1
Blastfurnace
  • 18,411
  • 56
  • 55
  • 70
  • 1
    30 instructions in 64-bit code, I suppose? How about a LUT? Especially for 32-bit code? – Alexey Frunze Dec 08 '15 at 13:13
  • @AlexeyFrunze at this point you could just post a decompiled LUT solution as another answer, I suppose – Bartek Banachewicz Dec 08 '15 at 13:15
  • @AlexeyFrunze: Yes, x64 is 29 instruction and x86 is 54 instructions (MSVC 2015). – Blastfurnace Dec 08 '15 at 13:23
  • Interesting, but I'm really not comfortable upvoting magic like this without a live demo. – sehe Dec 08 '15 at 13:38
  • 1
    @sehe: [Here is the code on ideone.com.](http://ideone.com/DLz4XL) You can fork it and provide different input values. I hope it's correct... – Blastfurnace Dec 08 '15 at 13:53
  • @sehe the linked answer explains how it works rather clearly – Bartek Banachewicz Dec 08 '15 at 14:05
  • 2
    Nice. You can remove at least one of the original bit masks as well, since it doesn't matter what happens on the bits you don't collect at the end. – Jack Aidley Dec 08 '15 at 14:33
  • @JackAidley: The bitmask on the last line can also be removed. – Blastfurnace Dec 08 '15 at 14:37
  • While the pext solution is narrowly faster, I think this is the cleanest solution for it's higher portability. – Jack Aidley Dec 08 '15 at 16:11
  • 2
    Since we're de-interleaving, we don't need to mask with 0x555... and 0xAAA... Just do `(n | (n>>1))` and the de-interleaving will get rid of the unwanted bits. That should shave at least two instructions. – slebetman Dec 09 '15 at 09:12
  • 1
    You can NOT skip the initial step of masking the inputs (although you can reduce it to one step of masking the resulting x). Error Case: Binary: `11001001` expected output: `1011` output with "optimized" version: `1111` – Falco Dec 09 '15 at 10:35
  • @Falco: Thank you. I think the mask should be all `0101` though since this uses the odd bits. – Blastfurnace Dec 09 '15 at 10:48
  • You still need the mask after the last operation, to get rid of the high bits. Try it with a number that has any of the bits 32-63 set. You can avoid the mask if you change the return type to `uint32_t`. – interjay Dec 10 '15 at 17:34
  • @interjay: Thank you. I thought I tested that but I obviously missed it. Fixed. – Blastfurnace Dec 10 '15 at 21:04
44

Probably fastest solution for x86 architecture with BMI2 instruction set:

#include <stdint.h>
#include <x86intrin.h>

uint32_t calc (uint64_t a)
{
   return _pext_u64(a, 0x5555555555555555ull) |
          _pext_u64(a, 0xaaaaaaaaaaaaaaaaull);
}

This compiles to 5 instructions total.

Mark Booth
  • 7,605
  • 2
  • 68
  • 92
Nils Pipenbrinck
  • 83,631
  • 31
  • 151
  • 221
  • 5
    next question: what's the optimal implementation of `pext` for CPUs that don't have it? – M.M Dec 08 '15 at 12:07
  • Very nice. But I'd still like a decent implementation for non-x86 CPUs. – Jack Aidley Dec 08 '15 at 12:59
  • 1
    @M.M: that probably *isn't* really the next question though, because for CPUs that don't have `pext` there are probably better options than emulating it. We only really need one special case. It's up around the questions worth asking, sure :-) – Steve Jessop Dec 08 '15 at 14:39
  • And BMI2 is pretty recent, and currently unsupported by AMD so it can't really be relied upon to be there. – Jack Aidley Dec 08 '15 at 15:00
  • 7
    @Nils: Probably faster: `_pext_u64( a | a>>1, 0b01010101...)`. Better throughput, same latency, on Intel Haswell. (`pext/pext/OR` is 5 cycle latency, because pext is 3 cycle latency and can only run on port1. So both results won't be ready until the 4th cycle after `a` is ready, at which point `OR` takes a cycle.) Throughput is one per 2 cycles, bottlenecked on port1 for `pext`. `shr/or/pext` is also 5c: shr(1)->or(1)->pext(3) dep chain, but shr can run on port0/6, `or` can run on p0/1/5/6. So throughput = one per cycle, bottlenecked on port1 for pext. – Peter Cordes Dec 08 '15 at 16:55
  • @Jack: Microbenchmarks are hard. Measuring something in a tight loop measures the throughput of that one sequence, which isn't always the same as the impact it has when mixed in with other instructions. Your comments on the question suggest that you found pext only slightly faster than a LUT or shift/or. That's insane, there's definitely something funny going on. Maybe your `pext` control-mask isn't getting cached in a register? You should be getting much higher throughput from `pext`, unless you're just bottlenecking on main memory with a big buffer. – Peter Cordes Dec 08 '15 at 16:56
  • 1
    Update: `_pext_u64( a | a<<1, 0b101010...)` compiles to only 4 instructions on x86: It saves a `mov` because it can use `lea rax, [rdi+rdi]` as a non-destructive left-shift. See my answer. But yes, including a BMI2 version of this is only useful if you're going to do runtime CPU detection and call this (or a function that inlines it) through a function pointer (or other method of CPU dispatching, like dynamic library tricks). – Peter Cordes Feb 02 '16 at 15:44
14

If you do not have pext and you still want to do this better than the trivial way then this extraction can be expressed as a logarithmic number (if you generalized it in terms of length) of bit moves:

// OR adjacent bits, destroys the odd bits but it doesn't matter
x = (x | (x >> 1)) & rep8(0x55);
// gather the even bits with delta swaps
x = bitmove(x, rep8(0x44), 1);   // make pairs
x = bitmove(x, rep8(0x30), 2);   // make nibbles
x = bitmove(x, rep4(0x0F00), 4); // make bytes
x = bitmove(x, rep2(0x00FF0000), 8); // make words
res = (uint32_t)(x | (x >> 16)); // final step is simpler

With:

bitmove(x, mask, step) {
    return x | ((x & mask) >> step);
}

repk is just so I could write shorter constants. rep8(0x44) = 0x4444444444444444 etc.

Also if you do have pext, you can do it with only one of them, which is probably faster and at least shorter:

_pext_u64(x | (x >> 1), rep8(0x55));
ne1410s
  • 6,864
  • 6
  • 55
  • 61
harold
  • 61,398
  • 6
  • 86
  • 164
  • This is pretty cool. So you're essentially removing the gaps made by the `OR` with those bit moves? – Bartek Banachewicz Dec 08 '15 at 13:09
  • @BartekBanachewicz yes perhaps it's more obvious in Blastfurnace's answer, comes down to the same thing somehow – harold Dec 08 '15 at 13:09
  • I didn't see this answer before [I commented on Nils'](http://stackoverflow.com/questions/34154745/efficient-way-to-or-adjacent-bits-in-64-bit-integer#comment56069863_34155404): `pext(x| x>>1, ...)` is still 5 cycle latency, but one per one clock throughput instead of one per two clock. (Intel Haswell: bottleneck on port1 for pext). – Peter Cordes Dec 08 '15 at 17:48
10

Okay, let's make this more hacky then (might be buggy):

uint64_t x;

uint64_t even_bits = x & 0xAAAAAAAAAAAAAAAAull;
uint64_t odd_bits  = x & 0x5555555555555555ull;

Now, my original solution did this:

// wrong
even_bits >> 1;
unsigned int solution = even_bits | odd_bits;

However, as JackAidley pointed out, while this aligns the bits together, it doesn't remove the spaces from the middle!

Thankfully, we can use a very helpful _pext instruction from the BMI2 instruction set.

u64 _pext_u64(u64 a, u64 m) - Extract bits from a at the corresponding bit locations specified by mask m to contiguous low bits in dst; the remaining upper bits in dst are set to zero.

solution = _pext_u64(solution, odd_bits);

Alternatively, instead of using & and >> to separate out the bits, you might just use _pext twice on the original number with the provided masks (which would split it up into two contiguous 32-bit numbers), and then simply or the results.

If you don't have access to BMI2, though, I'm pretty sure the gap removal would still involve a loop; a bit simpler one than your original idea, perhaps.

Mark Booth
  • 7,605
  • 2
  • 68
  • 92
Bartek Banachewicz
  • 38,596
  • 7
  • 91
  • 135
  • This leaves gaps between the values so that rather than 01001011 going to 1011 it goes to 1001010 – Jack Aidley Dec 08 '15 at 11:44
  • Those are 32-bit masks – M.M Dec 08 '15 at 11:44
  • 1
    @M.M. extension to 64-bits is trivial – Bartek Banachewicz Dec 08 '15 at 11:44
  • 1
    @Jack. Ah, I see what you mean (had to think for a while). I wonder whether this can be fixed without a loop though. – Bartek Banachewicz Dec 08 '15 at 11:47
  • I think there's an SSE instruction which gets the sign bit of each byte. – Rotem Dec 08 '15 at 11:48
  • @Rotem There's a `_pext` instruction called "bit gather" that does that for *arbitrary* bits, but I'm looking for something specifically for odd/even. – Bartek Banachewicz Dec 08 '15 at 11:51
  • @JackAidley Added a way to remove that using SSE; Rotem pointed me in the right direction. I'm sorry that I wasn't able to create a super-universal solution; I just found the problem interesting enough to give it a try. – Bartek Banachewicz Dec 08 '15 at 11:57
  • 1
    @BartekBanachewicz: `pext` is part of BMI2, not SSE or AVX1/2. Inside a virtual machine, you could easily have a virtual CPU that didn't have AVX, but did have BMI2, even though the only physical CPUs that currently have BMI2 also have AVX2 (Intel Haswell and onwards, and apparently AMD Excavator will have it too). AMD Piledriver and Steamroller support BMI1, but not BMI2. – Peter Cordes Dec 08 '15 at 17:04
7

Slight improvement over the LUT approach (4 lookups instead of 8):

Compute the bitwise-or and clear every other bit. Then intertwine the bits of pairs of bytes to yield four bytes. Finally, reorder the bits in the four bytes (mapped on the quadword) by means of a 256-entries lookup-table:

Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL; // OR in pairs
Q|= Q >> 9; // Intertwine 4 words into 4 bytes
B0= LUT[B0]; B1= LUT[B2]; B2= LUT[B4]; B3= LUT[B6]; // Rearrange bits in bytes
  • If you can live with the final bits being scrambled, `Q|= Q >> 33` is enough. –  Dec 08 '15 at 13:43
  • The bits need to be in the same order – Jack Aidley Dec 08 '15 at 14:26
  • @JackAidley Then two shifts, two bitwise-or, one bitwise-and with a constant (all on 64 bits), and four byte lookups to unscramble. –  Dec 08 '15 at 14:31
  • Vector shuffles (like x86 `pshufb`) can be used to implement a 4bit LUT of bytes. So you could apply this with a `Q |= Q >> 5`. Since we only want the low nibbles of each byte, that's just a simple vector AND. – Peter Cordes Dec 08 '15 at 17:19
6

The hard part seem to be to pack the bits after oring. The oring is done by:

ored  = (x | (x>>1)) & 0x5555555555555555;

(assuming large enough ints so we don't have to use suffixes). Then we can pack then in steps first packing them two-by-two, four-by-four etc:

pack2 = ((ored*3) >> 1) & 0x333333333333;
pack4 = ((ored*5) >> 2) & 0x0F0F0F0F0F0F;
pack8 = ((ored*17) >> 4) & 0x00FF00FF00FF;
pac16 = ((ored*257) >> 8) & 0x0000FFFF0000FFFF;
pack32 = ((ored*65537) >> 16) & 0xFFFFFFFF;
// (or cast to uint32_t instead of the final & 0xFFF...)

The thing that happens in the packing is that by multiplying we combine the data with the data shifted. In your example we would have in first multiplication (I denote zeros from the masking in ored as o and the other 0 (from the original data)):

 o1o0o1o1
     x 11
----------
 o1o0o1o1
o1o0o1o1
----------
o11001111
  ^^  ^^ 
 o10oo11o < these are bits we want to keep.

We could have done that by oring as well:

ored = (ored | (ored>>1)) & 0x3333333333333333;
ored = (ored | (ored>>2)) & 0x0F0F0F0F0F0F0F0F;
ored = (ored | (ored>>4)) & 0x00FF00FF00FF00FF;
ored = (ored | (ored>>8)) & 0x0000FFFF0000FFFF;
ored = (ored | (ored>>16)) & 0xFFFFFFFF;

// ored = ((uint32_t)ored | (uint32_t)(ored>>16));  // helps some compilers make better code, esp. on x86
skyking
  • 13,817
  • 1
  • 35
  • 57
  • This mul method is actually pretty solid for x86. `imul r64, r64, imm32` is 3 cycle latency, one per 1 cycle throughput on Sandybridge. This competes well with the copy/shift/or it does for the OR version. – Peter Cordes Dec 09 '15 at 19:15
  • I think your initial `OR` should be `ored = (x | x>>1) & 0x55...ULL`, though. `x| x<<1` will put the valuable bits in the high bit of each pair (consider what happens at the left or right end, where a zero is shifted in). Note that the final masking down to 32bit can be done with a cast to `uint32_t`. (gcc isn't clever enough to see that it can leave garbage in the upper32 of a register if the return type is 32bit, or use a 32bit `or` after the final shift for the OR method.) – Peter Cordes Dec 09 '15 at 19:18
  • Oops, actually gcc does leave out the final mask, and casting just helps it do a 32bit `or` instead of a 64bit `or`, saving one byte. However, you can save a lot of `shift` instructions by accumulating the contiguous bits in the upper half, instead of always right-shifting after each multiply. clang actually makes a really short function, using `imul`. (but gcc breaks the multiplies into `mov/shl/add`, when they're too big for `LEA`, so gcc similar code for multiply or left-shift.) In a loop, both ways auto-vectorize. http://goo.gl/k3JMYW for benchmarks and some vectorized versions too. – Peter Cordes Dec 10 '15 at 07:34
2

I made some vectorized versions (godbolt link still with some big design-notes comments) and did some benchmarks when this question was new. I was going to spend more time on it, but never got back to it. Posting what I have so I can close this browser tab. >.< Improvements welcome.

I don't have a Haswell I could test on, so I couldn't benchmark the pextr version against this. I'm sure it's faster, though, since it's only 4 fast instructions.

 *** Sandybridge (i5-2500k, so no hyperthreading)
 *** 64bit, gcc 5.2 with -O3 -fno-tree-vectorize results:
 TODO: update benchmarks for latest code changes

   total cycles, and insn/clock, for the test-loop
   This measures only throughput, not latency,
   and a bottleneck on one execution port might make a function look worse in a microbench
   than it will do when mixed with other code that can keep the other ports busy.

Lower numbers in the first column are better: 
these are total cycle counts in Megacycles, and correspond to execution time 
but they take frequency scaling / turbo out of the mix.
(We're not cache / memory bound at all, so low core clock = fewer cycles for cache miss doesn't matter).

     AVX                  no AVX
887.519Mc  2.70Ipc      887.758Mc  2.70Ipc    use_orbits_shift_right
1140.68Mc  2.45Ipc      1140.47Mc  2.46Ipc    use_orbits_mul  (old version that right-shifted after each)
718.038Mc  2.79Ipc      716.452Mc  2.79Ipc    use_orbits_x86_lea
767.836Mc  2.74Ipc      1027.96Mc  2.53Ipc    use_orbits_sse2_shift
619.466Mc  2.90Ipc      816.698Mc  2.69Ipc    use_orbits_ssse3_shift
845.988Mc  2.72Ipc      845.537Mc  2.72Ipc    use_orbits_ssse3_shift_scalar_mmx (gimped by stupid compiler)
583.239Mc  2.92Ipc      686.792Mc  2.91Ipc    use_orbits_ssse3_interleave_scalar
547.386Mc  2.92Ipc      730.259Mc  2.88Ipc    use_orbits_ssse3_interleave

The fastest (for throughput in a loop) with    AVX is orbits_ssse3_interleave
The fastest (for throughput in a loop) without AVX is orbits_ssse3_interleave_scalar
but obits_x86_lea comes very close.

AVX for non-destructive 3-operand vector insns helps a lot
Maybe a bit less important on IvB and later, where mov-elimination handles mov uops at register-rename time

// Tables generated with the following commands:
// for i in avx.perf{{2..4},{6..10}};do awk '/cycles   / {c=$1; gsub(",", "", c); }  /insns per cy/ {print c / 1000000 "Mc  " $4"Ipc"}' *"$i"*;done | column -c 50 -x
//  Include 0 and 1 for hosts with pextr
// 5 is omitted because it's not written

The almost certainly best version (with BMI2) is:

#include <stdint.h>
#define LOBITS64 0x5555555555555555ull
#define HIBITS64 0xaaaaaaaaaaaaaaaaull

uint32_t orbits_1pext (uint64_t a) {
    // a|a<<1 compiles more efficiently on x86 than a|a>>1, because of LEA for non-destructive left-shift
    return _pext_u64( a | a<<1, HIBITS64);
}

This compiles to:

    lea     rax, [rdi+rdi]
    or      rdi, rax
    movabs  rax, -6148914691236517206
    pext    rax, rdi, rax
    ret

So it's only 4 uops, and the critical path latency is 5c = 3(pext) + 1(or) + 1(lea). (Intel Haswell). The throughput should be one result per cycle (with no loop overhead or loading/storing). The mov imm for the constant can be hoisted out of a loop, though, since it isn't destroyed. This means throughput-wise that we only need 3 fused-domain uops per result.

The mov r, imm64 isn't ideal. (A 1uop broadcast-immediate 32 or 8bit to a 64bit reg would be ideal, but there's no such instruction). Having the constant in data memory is an option, but inline in the instruction stream is nice. A 64b constant takes a lot of uop-cache space, which makes the version that does pext with two different masks even worse. Generating one mask from the other with a not could help with that, though: movabs / pext / not / pext / or, but that's still 5 insns compared to the 4 enabled by the lea trick.


The best version (with AVX) is:

#include <immintrin.h>

/* Yves Daoust's idea, operating on nibbles instead of bytes:
   original:
   Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL // OR in pairs
   Q|= Q >> 9; // Intertwine 4 words into 4 bytes
   B0= LUT[B0]; B1= LUT[B2]; B2= LUT[B4]; B3= LUT[B6]; // Rearrange bits in bytes

   To operate on nibbles,
   Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL // OR in pairs, same as before
   Q|= Q>>5  // Intertwine 8 nibbles into 8 bytes
   // pshufb as a LUT to re-order the bits within each nibble (to undo the interleave)
   // right-shift and OR to combine nibbles
   // pshufb as a byte-shuffle to put the 4 bytes we want into the low 4
*/
uint32_t orbits_ssse3_interleave(uint64_t scalar_a)
{
    // do some of this in GP regs if not doing two 64b elements in parallel.
    // esp. beneficial for AMD Bulldozer-family, where integer and vector ops don't share execution ports
    // but VEX-encoded SSE saves mov instructions

    __m128i a = _mm_cvtsi64_si128(scalar_a);
    // element size doesn't matter, any bits shifted out of element boundaries would have been masked off anyway.
    __m128i lshift = _mm_slli_epi64(a, 1);
    lshift = _mm_or_si128(lshift, a);
    lshift = _mm_and_si128(lshift, _mm_set1_epi32(0xaaaaaaaaUL));
    // a = bits:   h  g  f  e  d  c  b  a  (same thing in other bytes)
    // lshift =    hg 0 fe  0  dc 0  ba 0
    // lshift =    s  0  r  0  q  0  p  0

    // lshift =    s 0 r 0 q 0 p 0
    __m128i rshift = _mm_srli_epi64(lshift, 5);  // again, element size doesn't matter, we're keeping only the low nibbles
    // rshift =              s 0 r 0 q 0 p 0  (the last zero ORs with the top bit of the low nibble in the next byte over)
    __m128i nibbles = _mm_or_si128(rshift, lshift);
    nibbles = _mm_and_si128(nibbles, _mm_set1_epi8(0x0f) );  // have to zero the high nibbles: the sign bit affects pshufb

    // nibbles =   0 0 0 0 q s p r
    // pshufb ->   0 0 0 0 s r q p
    const __m128i BITORDER_NIBBLE_LUT = _mm_setr_epi8( // setr: first arg goes in the low byte, indexed by 0b0000
    0b0000,
    0b0100,
    0b0001,
    0b0101,
    0b1000,
    0b1100,
    0b1001,
    0b1101,
    0b0010,
    0b0110,
    0b0011,
    0b0111,
    0b1010,
    0b1110,
    0b1011,
    0b1111 );
    __m128i ord_nibbles = _mm_shuffle_epi8(BITORDER_NIBBLE_LUT, nibbles);

    // want            00 00 00 00 AB CD EF GH from:

    // ord_nibbles   = 0A0B0C0D0E0F0G0H
    //                  0A0B0C0D0E0F0G0 H(shifted out)
    __m128i merged_nibbles = _mm_or_si128(ord_nibbles, _mm_srli_epi64(ord_nibbles, 4));
    // merged_nibbles= 0A AB BC CD DE EF FG GH.  We want every other byte of this.
    //                 7  6  5  4  3  2  1  0
    // pshufb is the most efficient way.  Mask and then packuswb would work, but uses the shuffle port just like pshufb
    __m128i ord_bytes = _mm_shuffle_epi8(merged_nibbles, _mm_set_epi8(-1,-1,-1,-1, 14,12,10,8,
                                      -1,-1,-1,-1,  6, 4, 2,0) );
    return _mm_cvtsi128_si32(ord_bytes); // movd the low32 of the vector
    // _mm_extract_epi32(ord_bytes, 2); // If operating on two inputs in parallel: SSE4.1 PEXTRD the result from the upper half of the reg.
}

The best version without AVX is a slight modification that only works with one input at a time, only using SIMD for the shuffling. In theory using MMX instead of SSE would make more sense, esp. if we care about first-gen Core2 where 64b pshufb is fast, but 128b pshufb is not single cycle. Anyway, compilers did a bad job with MMX intrinsics. Also, EMMS is slow.

// same as orbits_ssse3_interleave, but doing some of the math in integer regs. (non-vectorized)
// esp. beneficial for AMD Bulldozer-family, where integer and vector ops don't share execution ports

// VEX-encoded SSE saves mov instructions, so full vector is preferable if building with VEX-encoding

// Use MMX for Silvermont/Atom/Merom(Core2): pshufb is slow for xmm, but fast for MMX.  Only 64b shuffle unit?
uint32_t orbits_ssse3_interleave_scalar(uint64_t scalar_a)
{
    uint64_t lshift = (scalar_a | scalar_a << 1);
    lshift &= HIBITS64;

    uint64_t rshift = lshift >> 5;
    // rshift =              s 0 r 0 q 0 p 0  (the last zero ORs with the top bit of the low nibble in the next byte over)
    uint64_t nibbles_scalar = (rshift | lshift) & 0x0f0f0f0f0f0f0f0fULL;
    // have to zero the high nibbles: the sign bit affects pshufb
    __m128i nibbles = _mm_cvtsi64_si128(nibbles_scalar);

    // nibbles =   0 0 0 0 q s p r
    // pshufb ->   0 0 0 0 s r q p

    const __m128i BITORDER_NIBBLE_LUT = _mm_setr_epi8( // setr: first arg goes in the low byte, indexed by 0b0000
    0b0000,
    0b0100,
    0b0001,
    0b0101,
    0b1000,
    0b1100,
    0b1001,
    0b1101,
    0b0010,
    0b0110,
    0b0011,
    0b0111,
    0b1010,
    0b1110,
    0b1011,
    0b1111 );
    __m128i ord_nibbles = _mm_shuffle_epi8(BITORDER_NIBBLE_LUT, nibbles);

    // want            00 00 00 00 AB CD EF GH from:

    // ord_nibbles   = 0A0B0C0D0E0F0G0H
    //                  0A0B0C0D0E0F0G0 H(shifted out)
    __m128i merged_nibbles = _mm_or_si128(ord_nibbles, _mm_srli_epi64(ord_nibbles, 4));
    // merged_nibbles= 0A AB BC CD DE EF FG GH.  We want every other byte of this.
    //                 7  6  5  4  3  2  1  0
    // pshufb is the most efficient way.  Mask and then packuswb would work, but uses the shuffle port just like pshufb
    __m128i ord_bytes = _mm_shuffle_epi8(merged_nibbles, _mm_set_epi8(0,0,0,0, 0,0,0,0, 0,0,0,0, 6,4,2,0));
    return _mm_cvtsi128_si32(ord_bytes); // movd the low32 of the vector
}

Sorry for the mostly code-dump answer. At this point I didn't feel like it's worth spending a huge amount of time on discussing things more than the comments already do. See http://agner.org/optimize/ for guides to optimizing for specific microarchitectures. Also the wiki for other resources.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847