If your types are bigger than 32bits.
I can't quite understand the documentation on _mm256_permutevar8x32_epi32
but in practise, adding offset to identity permutation does a rotate - which is what you want (when you already got the number of leading 0s).
__m256i rotate_i32(__m256i w, int offset) {
__m256i identity = _mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0);
__m256i shuffle = _mm256_add_epi32(identity, _mm256_set1_epi32(offset));
return _mm256_permutevar8x32_epi32(w, shuffle);
}
Here is the godbolt: https://godbolt.org/z/Kv8oxs6oY
(-1, -2, -3, -4, -5, -6, -7, -8)
(-2, -3, -4, -5, -6, -7, -8, -1)
(-3, -4, -5, -6, -7, -8, -1, -2)
(-4, -5, -6, -7, -8, -1, -2, -3)
(-5, -6, -7, -8, -1, -2, -3, -4)
(-6, -7, -8, -1, -2, -3, -4, -5)
(-7, -8, -1, -2, -3, -4, -5, -6)
(-8, -1, -2, -3, -4, -5, -6, -7)
The same trick works for 64 bits, but you need to mutliply offset by 2.
__m256i rotate_i64(__m256i w, int offset) {
__m256i identity = _mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0);
__m256i shuffle = _mm256_add_epi32(identity, _mm256_set1_epi32(offset * 2));
return _mm256_permutevar8x32_epi32(w, shuffle);
}
Godbolt: https://godbolt.org/z/85h6aWPsW
Output:
(-1, -2, -3, -4)
(-2, -3, -4, -1)
(-3, -4, -1, -2)
(-4, -1, -2, -3)