I made some vectorized versions (godbolt link still with some big design-notes comments) and did some benchmarks when this question was new. I was going to spend more time on it, but never got back to it. Posting what I have so I can close this browser tab. >.< Improvements welcome.
I don't have a Haswell I could test on, so I couldn't benchmark the pextr
version against this. I'm sure it's faster, though, since it's only 4 fast instructions.
*** Sandybridge (i5-2500k, so no hyperthreading)
*** 64bit, gcc 5.2 with -O3 -fno-tree-vectorize results:
TODO: update benchmarks for latest code changes
total cycles, and insn/clock, for the test-loop
This measures only throughput, not latency,
and a bottleneck on one execution port might make a function look worse in a microbench
than it will do when mixed with other code that can keep the other ports busy.
Lower numbers in the first column are better:
these are total cycle counts in Megacycles, and correspond to execution time
but they take frequency scaling / turbo out of the mix.
(We're not cache / memory bound at all, so low core clock = fewer cycles for cache miss doesn't matter).
AVX no AVX
887.519Mc 2.70Ipc 887.758Mc 2.70Ipc use_orbits_shift_right
1140.68Mc 2.45Ipc 1140.47Mc 2.46Ipc use_orbits_mul (old version that right-shifted after each)
718.038Mc 2.79Ipc 716.452Mc 2.79Ipc use_orbits_x86_lea
767.836Mc 2.74Ipc 1027.96Mc 2.53Ipc use_orbits_sse2_shift
619.466Mc 2.90Ipc 816.698Mc 2.69Ipc use_orbits_ssse3_shift
845.988Mc 2.72Ipc 845.537Mc 2.72Ipc use_orbits_ssse3_shift_scalar_mmx (gimped by stupid compiler)
583.239Mc 2.92Ipc 686.792Mc 2.91Ipc use_orbits_ssse3_interleave_scalar
547.386Mc 2.92Ipc 730.259Mc 2.88Ipc use_orbits_ssse3_interleave
The fastest (for throughput in a loop) with AVX is orbits_ssse3_interleave
The fastest (for throughput in a loop) without AVX is orbits_ssse3_interleave_scalar
but obits_x86_lea comes very close.
AVX for non-destructive 3-operand vector insns helps a lot
Maybe a bit less important on IvB and later, where mov-elimination handles mov uops at register-rename time
// Tables generated with the following commands:
// for i in avx.perf{{2..4},{6..10}};do awk '/cycles / {c=$1; gsub(",", "", c); } /insns per cy/ {print c / 1000000 "Mc " $4"Ipc"}' *"$i"*;done | column -c 50 -x
// Include 0 and 1 for hosts with pextr
// 5 is omitted because it's not written
The almost certainly best version (with BMI2) is:
#include <stdint.h>
#define LOBITS64 0x5555555555555555ull
#define HIBITS64 0xaaaaaaaaaaaaaaaaull
uint32_t orbits_1pext (uint64_t a) {
// a|a<<1 compiles more efficiently on x86 than a|a>>1, because of LEA for non-destructive left-shift
return _pext_u64( a | a<<1, HIBITS64);
}
This compiles to:
lea rax, [rdi+rdi]
or rdi, rax
movabs rax, -6148914691236517206
pext rax, rdi, rax
ret
So it's only 4 uops, and the critical path latency is 5c = 3(pext) + 1(or) + 1(lea). (Intel Haswell). The throughput should be one result per cycle (with no loop overhead or loading/storing). The mov imm
for the constant can be hoisted out of a loop, though, since it isn't destroyed. This means throughput-wise that we only need 3 fused-domain uops per result.
The mov r, imm64
isn't ideal. (A 1uop broadcast-immediate 32 or 8bit to a 64bit reg would be ideal, but there's no such instruction). Having the constant in data memory is an option, but inline in the instruction stream is nice. A 64b constant takes a lot of uop-cache space, which makes the version that does pext
with two different masks even worse. Generating one mask from the other with a not
could help with that, though: movabs
/ pext
/ not
/ pext
/ or
, but that's still 5 insns compared to the 4 enabled by the lea
trick.
The best version (with AVX) is:
#include <immintrin.h>
/* Yves Daoust's idea, operating on nibbles instead of bytes:
original:
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL // OR in pairs
Q|= Q >> 9; // Intertwine 4 words into 4 bytes
B0= LUT[B0]; B1= LUT[B2]; B2= LUT[B4]; B3= LUT[B6]; // Rearrange bits in bytes
To operate on nibbles,
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL // OR in pairs, same as before
Q|= Q>>5 // Intertwine 8 nibbles into 8 bytes
// pshufb as a LUT to re-order the bits within each nibble (to undo the interleave)
// right-shift and OR to combine nibbles
// pshufb as a byte-shuffle to put the 4 bytes we want into the low 4
*/
uint32_t orbits_ssse3_interleave(uint64_t scalar_a)
{
// do some of this in GP regs if not doing two 64b elements in parallel.
// esp. beneficial for AMD Bulldozer-family, where integer and vector ops don't share execution ports
// but VEX-encoded SSE saves mov instructions
__m128i a = _mm_cvtsi64_si128(scalar_a);
// element size doesn't matter, any bits shifted out of element boundaries would have been masked off anyway.
__m128i lshift = _mm_slli_epi64(a, 1);
lshift = _mm_or_si128(lshift, a);
lshift = _mm_and_si128(lshift, _mm_set1_epi32(0xaaaaaaaaUL));
// a = bits: h g f e d c b a (same thing in other bytes)
// lshift = hg 0 fe 0 dc 0 ba 0
// lshift = s 0 r 0 q 0 p 0
// lshift = s 0 r 0 q 0 p 0
__m128i rshift = _mm_srli_epi64(lshift, 5); // again, element size doesn't matter, we're keeping only the low nibbles
// rshift = s 0 r 0 q 0 p 0 (the last zero ORs with the top bit of the low nibble in the next byte over)
__m128i nibbles = _mm_or_si128(rshift, lshift);
nibbles = _mm_and_si128(nibbles, _mm_set1_epi8(0x0f) ); // have to zero the high nibbles: the sign bit affects pshufb
// nibbles = 0 0 0 0 q s p r
// pshufb -> 0 0 0 0 s r q p
const __m128i BITORDER_NIBBLE_LUT = _mm_setr_epi8( // setr: first arg goes in the low byte, indexed by 0b0000
0b0000,
0b0100,
0b0001,
0b0101,
0b1000,
0b1100,
0b1001,
0b1101,
0b0010,
0b0110,
0b0011,
0b0111,
0b1010,
0b1110,
0b1011,
0b1111 );
__m128i ord_nibbles = _mm_shuffle_epi8(BITORDER_NIBBLE_LUT, nibbles);
// want 00 00 00 00 AB CD EF GH from:
// ord_nibbles = 0A0B0C0D0E0F0G0H
// 0A0B0C0D0E0F0G0 H(shifted out)
__m128i merged_nibbles = _mm_or_si128(ord_nibbles, _mm_srli_epi64(ord_nibbles, 4));
// merged_nibbles= 0A AB BC CD DE EF FG GH. We want every other byte of this.
// 7 6 5 4 3 2 1 0
// pshufb is the most efficient way. Mask and then packuswb would work, but uses the shuffle port just like pshufb
__m128i ord_bytes = _mm_shuffle_epi8(merged_nibbles, _mm_set_epi8(-1,-1,-1,-1, 14,12,10,8,
-1,-1,-1,-1, 6, 4, 2,0) );
return _mm_cvtsi128_si32(ord_bytes); // movd the low32 of the vector
// _mm_extract_epi32(ord_bytes, 2); // If operating on two inputs in parallel: SSE4.1 PEXTRD the result from the upper half of the reg.
}
The best version without AVX is a slight modification that only works with one input at a time, only using SIMD for the shuffling. In theory using MMX instead of SSE would make more sense, esp. if we care about first-gen Core2 where 64b pshufb is fast, but 128b pshufb is not single cycle. Anyway, compilers did a bad job with MMX intrinsics. Also, EMMS is slow.
// same as orbits_ssse3_interleave, but doing some of the math in integer regs. (non-vectorized)
// esp. beneficial for AMD Bulldozer-family, where integer and vector ops don't share execution ports
// VEX-encoded SSE saves mov instructions, so full vector is preferable if building with VEX-encoding
// Use MMX for Silvermont/Atom/Merom(Core2): pshufb is slow for xmm, but fast for MMX. Only 64b shuffle unit?
uint32_t orbits_ssse3_interleave_scalar(uint64_t scalar_a)
{
uint64_t lshift = (scalar_a | scalar_a << 1);
lshift &= HIBITS64;
uint64_t rshift = lshift >> 5;
// rshift = s 0 r 0 q 0 p 0 (the last zero ORs with the top bit of the low nibble in the next byte over)
uint64_t nibbles_scalar = (rshift | lshift) & 0x0f0f0f0f0f0f0f0fULL;
// have to zero the high nibbles: the sign bit affects pshufb
__m128i nibbles = _mm_cvtsi64_si128(nibbles_scalar);
// nibbles = 0 0 0 0 q s p r
// pshufb -> 0 0 0 0 s r q p
const __m128i BITORDER_NIBBLE_LUT = _mm_setr_epi8( // setr: first arg goes in the low byte, indexed by 0b0000
0b0000,
0b0100,
0b0001,
0b0101,
0b1000,
0b1100,
0b1001,
0b1101,
0b0010,
0b0110,
0b0011,
0b0111,
0b1010,
0b1110,
0b1011,
0b1111 );
__m128i ord_nibbles = _mm_shuffle_epi8(BITORDER_NIBBLE_LUT, nibbles);
// want 00 00 00 00 AB CD EF GH from:
// ord_nibbles = 0A0B0C0D0E0F0G0H
// 0A0B0C0D0E0F0G0 H(shifted out)
__m128i merged_nibbles = _mm_or_si128(ord_nibbles, _mm_srli_epi64(ord_nibbles, 4));
// merged_nibbles= 0A AB BC CD DE EF FG GH. We want every other byte of this.
// 7 6 5 4 3 2 1 0
// pshufb is the most efficient way. Mask and then packuswb would work, but uses the shuffle port just like pshufb
__m128i ord_bytes = _mm_shuffle_epi8(merged_nibbles, _mm_set_epi8(0,0,0,0, 0,0,0,0, 0,0,0,0, 6,4,2,0));
return _mm_cvtsi128_si32(ord_bytes); // movd the low32 of the vector
}
Sorry for the mostly code-dump answer. At this point I didn't feel like it's worth spending a huge amount of time on discussing things more than the comments already do. See http://agner.org/optimize/ for guides to optimizing for specific microarchitectures. Also the x86 wiki for other resources.