5

Is there a (fast) way to perform bits reverse of 32bit int values within avx2 register? E.g.

_mm256_set1_epi32(2732370386); 
<do something here>
//binary: 10100010110111001010100111010010 => 1001011100101010011101101000101
//register contains 1268071237 which is decimal representation of 1001011100101010011101101000101
Paul R
  • 208,748
  • 37
  • 389
  • 560
John Smith
  • 161
  • 1
  • 7
  • You want to reverse the bits of a single int32 in an AVX2 integer register, or you want to reverse the bits of each of the 8 such integers? – John Zwinck Sep 20 '17 at 09:00
  • 1
    @JohnZwinck, It doesn't really matter: once I get the idea how to do that, I can shuffle the 32bit values within the register any way I like. – John Smith Sep 20 '17 at 09:02
  • The old way (reverse bytes, reverse groups of 4 with `pshufb`, OR results) generalizes to AVX2, I can't immediate find the dupe though – harold Sep 20 '17 at 09:04
  • 2
    You can just use any of the standard bit twiddling hacks to reverse bytes and then shuffle the bytes (see e.g. [Hacker's Delight](https://www.amazon.com/Hackers-Delight-2nd-Henry-Warren/dp/0321842685) and [this list](http://graphics.stanford.edu/%7Eseander/bithacks.html#BitReverseObvious)). – Paul R Sep 20 '17 at 09:33

1 Answers1

9

Since I can't find a suitable dupe, I'll just post it.

The main idea here is to make use of pshufb's dual use a parallel 16-entry table lookup to reverse the bits of each nibble. Reversing bytes is obvious. Reversing the order of the two nibble in every byte could be done by building it into the lookup tables (saves a shift) or by explicitly shifting the low part nibble up (saves a LUT).

Something like this in total, not tested:

__m256i rbit32(__m256i x) {
    __m256i shufbytes = _mm256_setr_epi8(3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12, 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12);
    __m256i luthigh = _mm256_setr_epi8(0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15, 0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15);
    __m256i lutlow = _mm256_slli_epi16(luthigh, 4);
    __m256i lowmask = _mm256_set1_epi8(15);
    __m256i rbytes = _mm256_shuffle_epi8(x, shufbytes);
    __m256i high = _mm256_shuffle_epi8(lutlow, _mm256_and_si256(rbytes, lowmask));
    __m256i low = _mm256_shuffle_epi8(luthigh, _mm256_and_si256(_mm256_srli_epi16(rbytes, 4), lowmask));
    return _mm256_or_si256(low, high);
}

In a typical context in a loop, those loads should be lifted out.

Curiously Clang uses 4 shuffles, it's duplicating the first shuffle.

harold
  • 61,398
  • 6
  • 86
  • 164
  • @LưuVĩnhPhúc The solution in your link flips bytes instead of 32-bit ints. – wim Sep 20 '17 at 10:02
  • 1
    Gcc7.2 is pretty braindead here, too. It turns `lutlow` into a separate memory constant, but it shifts it after loading. (Without having used it for anything else). I think it just created a copy of `luthigh` but using `.value` assembler directives (16-bit chunks) instead of `.byte`. – Peter Cordes Sep 20 '17 at 11:32