Strange behaviour of _mm256_shuffle_epi8

Question

I have following code:

    auto source= _mm256_set_epi8(31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
    auto shuffle= _mm256_set_epi8(31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 15);
    auto resultOfShuffle = _mm256_shuffle_epi8(source, shuffle);

The result is

{31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 15}

How is this possible? What's so special with number 16?

My processor is 8750h. I'm using Visual Studio 16.5.2

This is dissassembly

auto source = _mm256_set_epi8(31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
00007FF791AD1A64  vmovdqu     ymm0,ymmword ptr [__ymm@1f1e1d1c1b1a191817161514131211100f0e0d0c0b0a09080706050403020100 (07FF791ADCBC0h)]  
00007FF791AD1A6C  vmovdqu     ymmword ptr [rbp+1A0h],ymm0  
00007FF791AD1A74  vmovdqu     ymm0,ymmword ptr [rbp+1A0h]  
00007FF791AD1A7C  vmovdqu     ymmword ptr [source],ymm0  
    auto shuffle = _mm256_set_epi8(31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 15);
00007FF791AD1A81  vmovdqu     ymm0,ymmword ptr [__ymm@1f1f1f1f1f1f1f1f1f1f1f1f1f1f1f1f1010101010101010101010101010100f (07FF791ADCC00h)]  
00007FF791AD1A89  vmovdqu     ymmword ptr [rbp+1E0h],ymm0  
00007FF791AD1A91  vmovdqu     ymm0,ymmword ptr [rbp+1E0h]  
00007FF791AD1A99  vmovdqu     ymmword ptr [shuffle],ymm0  
    auto resultOfShuffle = _mm256_shuffle_epi8(source, shuffle);
00007FF791AD1A9E  vmovdqu     ymm0,ymmword ptr [source]  
00007FF791AD1AA3  vpshufb     ymm0,ymm0,ymmword ptr [shuffle]  
00007FF791AD1AA9  vmovdqu     ymmword ptr [rbp+220h],ymm0  
00007FF791AD1AB1  vmovdqu     ymm0,ymmword ptr [rbp+220h]  
00007FF791AD1AB9  vmovdqu     ymmword ptr [resultOfShuffle],ymm0

Were you expecting it to be lane-crossing? It's not, it's a separate `_mm_shuffle_epi8` in each 128-bit lane, still only using the low 4 bits of the index. https://www.felixcloutier.com/x86/pshufb. What you wanted doesn't exist until AVX512VBMI `vpermb` — Peter Cordes, Apr 03 '20 at 14:45
@PeterCordes Ok, that makes sense. What is the best way to combine first and second lane? What I would like to do is something like _mm256_srli_si256(source,16) and then _mm256_max_epu8. Is that possible? I'm new to all this low level coding so this is all eye opening. — Marka, Apr 03 '20 at 15:17
Wait, so you're just trying to do a horizontal max? Extract the high lane with [`_mm256_extracti128_si256`](https://www.felixcloutier.com/x86/vextracti128:vextracti32x4:vextracti64x2:vextracti32x8:vextracti64x4). (Unfortunately Intel's current manuals are bloated with AVX512 stuff, making it harder to find just the AVX2 versions.) See also [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764) - you want the same shuffles as for horizontal sum, but with max instead of add. — Peter Cordes, Apr 03 '20 at 15:33
Note that `_mm256_srli_si256(source,16)` is in-lane as well, like most SSE stuff that got widened to AVX2. Only some new AVX2 instructions are lane-crossing. — Peter Cordes, Apr 03 '20 at 15:34

Strange behaviour of _mm256_shuffle_epi8

0 Answers0