For those who need to shift by exactly 64 bits, you can use the permute instruction which is directly going to work in registers. For a shift by a multiple of 8 bits, you could use the byte shuffle (see VPSHUFB
and look at the cast functions if you are dealing with floats as the shuffles uses integers).
Here is an example to shift by 64 bits ("SHR zmm1, 64"). The mask is used to clear the top 64 bits. If you want to ROR
like functionality, you can use the version without the mask. Note that it's possible to do a shift to the left as well. Just change the indexes as required.
#include <immintrin.h>
#include <iostream>
void show(char const * msg, double *v)
{
std::cout
<< msg
<< ": "
<< v[0]
<< " "
<< v[1]
<< " "
<< v[2]
<< " "
<< v[3]
<< " "
<< v[4]
<< " "
<< v[5]
<< " "
<< v[6]
<< " "
<< v[7]
<< "\n";
}
int main(int argc, char * argv[])
{
double v[8] = { 1., 2., 3., 4., 5., 6., 7., 8. };
double q[8] = {};
alignas(64) std::uint64_t indexes[8] = { 1, 2, 3, 4, 5, 6, 7, 0 };
show("init", v);
show("q", q);
// load
__m512d a(_mm512_loadu_pd(v));
__m512i i(_mm512_load_epi64(indexes));
// shift
//__m512d b(_mm512_permutex_pd(a, 0x39)); // can't cross between 4 low and 4 high with immediate
//__m512d b(_mm512_permutexvar_pd(i, a)); // ROR
__m512d b(_mm512_maskz_permutexvar_pd(0x7F, i, a)); // LSR on a double basis
// store
_mm512_storeu_pd(q, b);
show("shifted", q);
show("original", v);
}
Fully optimized output (-O3) reduces the whole shift to 3 instructions (which are intermingled with others in the output):
96a: 62 f1 fd 48 6f 85 10 vmovdqa64 -0xf0(%rbp),%zmm0
971: ff ff ff
974: b8 7f 00 00 00 mov $0x7f,%eax # mask
979: 48 8d 3d 10 04 00 00 lea 0x410(%rip),%rdi # d90 <_IO_stdin_used+0x10>
980: c5 f9 92 c8 kmovb %eax,%k1 # special k1 register
984: 4c 89 e6 mov %r12,%rsi
987: 62 f2 fd c9 16 85 d0 vpermpd -0x130(%rbp),%zmm0,%zmm0{%k1}{z} # "shift"
98e: fe ff ff
991: 62 f1 fd 48 11 45 fe vmovupd %zmm0,-0x80(%rbp)
In my case, I want to use that in a loop and the load (vmovdqa64
) and store (vmovupd
) are going to be before and after the loop, inside the loop, it will be really fast. (It needs to rotate that way 4,400 times before I need to save the result).
As pointed out by Peter, we can also use the valignq
instruction:
// this is in place of the permute, without the need for the indexes
__m512i b(_mm512_maskz_alignr_epi64(0xFF, _mm512_castpd_si512(a), _mm512_castpd_si512(a), 1));
and the result is one instruction like so:
979: 62 f1 fd 48 6f 85 d0 vmovdqa64 -0x130(%rbp),%zmm0
980: fe ff ff
983: 48 8d 75 80 lea -0x80(%rbp),%rsi
987: 48 8d 3d 02 04 00 00 lea 0x402(%rip),%rdi # d90 <_IO_stdin_used+0x10>
98e: 62 f3 fd 48 03 c0 01 valignq $0x1,%zmm0,%zmm0,%zmm0
995: 62 f1 fd 48 11 45 fd vmovupd %zmm0,-0xc0(%rbp)
An important point, using less registers is also much better since it increase our chances to get full optimizations 100% in registers instead of having to use memory (512 bits is a lot to transfer to and from memory).