1

Say there are a lot of uint32s store in aligned memory uint32 *p, how to convert them to uint8s with simd?

I see there is _mm256_cvtepi32_epi8/vpmovdb but it belongs to avx512, and my cpu doesn't support it

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Wiki Wang
  • 668
  • 6
  • 23
  • How exactly do you want to convert them? With saturation or truncation? What is the range of the 32-bit values? – Andrey Semashev Sep 07 '20 at 09:39
  • truncation them to 255 – Wiki Wang Sep 07 '20 at 10:33
  • You might be best starting with `vpshufb`. All the `vpack...` instructions treat their *input* as signed, even if they to unsigned saturation of the output (like `vpackusdw`), so `0xFFFFFFFF` would signed-saturate to `0` (-1 to 0) rather than to 0xFFFF (UINT_MAX -> USHORT_MAX) – Peter Cordes Sep 07 '20 at 10:36
  • > truncation them to 255 -- This does not clarify things. What should be the result of converting the value of 256? – Andrey Semashev Sep 07 '20 at 10:54
  • 1
    will I mean just pick the lowest 8 bits, 0x87654321 shall be 0x21 – Wiki Wang Sep 07 '20 at 11:02

1 Answers1

4

If you really have a lot of them, I would do something like this (untested).

The main loop reads 64 bytes per iteration containing 16 uint32_t values, shuffles around the bytes implementing the truncation, merges result into a single register, and writes 16 bytes with a vector store instruction.

void convertToBytes( const uint32_t* source, uint8_t* dest, size_t count )
{
    // 4 bytes of the shuffle mask to fetch bytes 0, 4, 8 and 12 from a 16-bytes source vector
    constexpr int shuffleScalar = 0x0C080400;
    // Mask to shuffle first 8 values of the batch, making first 8 bytes of the result
    const __m256i shuffMaskLow = _mm256_setr_epi32( shuffleScalar, -1, -1, -1, -1, shuffleScalar, -1, -1 );
    // Mask to shuffle last 8 values of the batch, making last 8 bytes of the result
    const __m256i shuffMaskHigh = _mm256_setr_epi32( -1, -1, shuffleScalar, -1, -1, -1, -1, shuffleScalar );
    // Indices for the final _mm256_permutevar8x32_epi32
    const __m256i finalPermute = _mm256_setr_epi32( 0, 5, 2, 7, 0, 5, 2, 7 );

    const uint32_t* const sourceEnd = source + count;
    // Vectorized portion, each iteration handles 16 values.
    // Round down the count making it a multiple of 16.
    const size_t countRounded = count & ~( (size_t)15 );
    const uint32_t* const sourceEndAligned = source + countRounded;
    while( source < sourceEndAligned )
    {
        // Load 16 inputs into 2 vector registers
        const __m256i s1 = _mm256_load_si256( ( const __m256i* )source );
        const __m256i s2 = _mm256_load_si256( ( const __m256i* )( source + 8 ) );
        source += 16;
        // Shuffle bytes into correct positions; this zeroes out the rest of the bytes.
        const __m256i low = _mm256_shuffle_epi8( s1, shuffMaskLow );
        const __m256i high = _mm256_shuffle_epi8( s2, shuffMaskHigh );
        // Unused bytes were zeroed out, using bitwise OR to merge, very fast.
        const __m256i res32 = _mm256_or_si256( low, high );
        // Final shuffle of the 32-bit values into correct positions
        const __m256i res16 = _mm256_permutevar8x32_epi32( res32, finalPermute );
        // Store lower 16 bytes of the result
        _mm_storeu_si128( ( __m128i* )dest, _mm256_castsi256_si128( res16 ) );
        dest += 16;
    }

    // Deal with the remainder
    while( source < sourceEnd )
    {
        *dest = (uint8_t)( *source );
        source++;
        dest++;
    }
}
Soonts
  • 20,079
  • 9
  • 57
  • 130
  • If you arrange your epi8 shuffles correctly, you should be able to do the final `res16` 32->16 byte shuffle with one `vpermd` (or maybe even `vpermq`), rather than `vextracti128` + `vpor`. Unless you're tuning for Zen1 (where lane-extract is very cheap), just 1 shuffle is better than shuffle+or. – Peter Cordes Sep 07 '20 at 13:39
  • Hmm, another alternative would be differently-aligned loads to feed a byte-blend + `vpshufb` + `vpermd`. IDK if that's any better, although Skylake runs `vpblendvb` as 2 uops for any ALU port. With a 64-byte aligned source, you can arrange it so none of the loads are cache-line splits. – Peter Cordes Sep 07 '20 at 14:11
  • @PeterCordes I wouldn’t mess with loads. The only reason sequential RAM loads are fast is prefetcher in CPUs, dense aligned sequential access is the best case for that piece of hardware. Once you start introducing offsets, you’re at the mercy of the implementation, may or may not do a good job performance-wise. – Soonts Sep 07 '20 at 14:51
  • Interesting point, that might possibly throw off L1d prefetching. But the main prefetchers are in L2 and they only see the stream of requests from L1 for full cache lines. But I'd guess even L1d prefetch would probably still be fine; you have an unrolled loop where each load sees an offset of 64 bytes since last iteration; the fact that the loads are offset from each other by 31 bytes is not AFAIK significant. I think there was another Q&A where someone implemented a similar alternating pair of slightly overlapping loads + blends for a similar problem with good results. – Peter Cordes Sep 07 '20 at 16:01
  • [How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 \_\_m256i)](https://stackoverflow.com/q/51778721) uses 4 shuffles per 4 input vectors to make a `__m256i`, vs. this using 3 shuffles per 2 input vectors. The 2x `vpackssdw` + `vpackuswb` + `vpermd` strategy seems better than this, if you have lots of data. – Peter Cordes May 11 '23 at 18:12
  • 1
    @PeterCordes Yeah, but that strategy would need 4 extra bitwise instructions to zero the higher 3 bytes in each integer, to work around the saturation of these packing instructions. And on latest-gen Intel CPUs, `vpshufb` is faster than `vpackssdw`, 2x throughput, 1/3 latency. I don’t think it’s a clear win, but on new AMD CPUs packing probably faster: packing same speed as `vpshufb`, and bitwise ops throughput is higher, 3-4 instructions/clock. – Soonts May 11 '23 at 18:31
  • Oh right, in the FP conversion question, the floats were supposed to be in a 0..255 value-range. This one doesn't specify, so some use-cases probably need truncation. If you actually *wanted* saturation, `vpackssdw` / `vpackusbw` is nice. Good point about shuffle throughputs on Ice Lake and later; ironically I was just commenting about `vshufpd` vs. `vpermilpd imm8` on another of your answers. – Peter Cordes May 11 '23 at 18:39