3

I have been trying to figure out the best way to use AMD64 SIMD instructions to implement a lerp to be used with large sets of u8 values but I can't seem to figure out the correct instructions without requiring all the SIMD extensions.

The formula I am working with right now is

u8* a;
u8* b;
u8* result;
size_t count;
u16 total;
u16 progress;

u32 invertedProgress = total - progress;
for(size_t i = 0; i < count; i++){
    result[i] = (u8)((b[i] * progress + a[i] * invertedProgress) / total);
}

I am thinking it would look something like:

u8* a;
u8* b;
u8* result;
size_t count;
u16 total;
u16 progress;

__m128i mmxZero;
__m128i mmxProgress;
__m128i mmxInvertedProgress;
__m128i mmxProductA;
__m128i mmxProductB;

mmxZero = _mm_xor_ps(zero, zero); // Is there a clear?

mmxProgress = Fill with progress;

mmxTotal = Fill with total;

mmxInvertedProgress = mmxTotal;
mmxInvertedProgress = _mm_unpacklo_epi8(mmxInvertedProgres, mmxZero);
mmxInvertedProgress = _mm_sub_epi8(mmxTotal, progress);

for(size_t i = 0; i < count; i += 8){
    mmxProductA = load A;
    // u8 -> u16
    mmxProductA = _mm_unpacklo_epi8(mmxProductA, mmxZero);
    
    mmxProductB = load B;
    // u8 -> u16
    mmxProductB = _mm_unpacklo_epi8(mmxProductB, mmxZero);

    // a * (total - progress)
    mmxProductA = _mm_mullo_epi16(mmxProductA, mmxInvertedProgress);
    // b * progress
    mmxProductB = _mm_mullo_epi16(mmxProductB, mmxProgress);

    // a * (total - progress) + b * progress
    mmxProductA = _mm_add_epi16(mmxProductA, mmxProductB);
    // (a * (total - progress) + b * progress) / total
    mmxProductA = _mm_div_epi16(mmxProductA, mmxTotal);

    mmxProductA = saturated u16 -> u8; 
    store result = maxProductA;
}

There are a couple of things here that I just could not seem to find digging around in the guide, mostly related to loading and storing values.

I know there are some newer instructions that can do larger amounts at the same time, this initial implementation is supposed to work on older chips.

For this example I am also ignoring alignment and the potential for a buffer overrun, I figured that was a little out of scope for the question.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
gudenau
  • 500
  • 5
  • 19
  • 1
    `_mm_div_epi16` is a library function, not an intrinsic for a hardware instruction. Even AVX512 doesn't have SIMD integer division, only floating-point division. Intel's intrinsics guide confusingly lists their SVML library functions along with intrinsics for real instructions, perhaps to tempt people into buying their libraries. You can filter that out with the checkboxes on the left side. – Peter Cordes Oct 02 '20 at 03:13
  • 1
    Also, MMX pre-dates SSE / SSE2, and has only 64-bit integer vectors. I'd suggest variable names like `__m128i vzero = _mm_setzero_si128();` or `_mm_set1_epi32(0)`. Don't try to xor-zero in your C source, let compilers do that optimization for you. – Peter Cordes Oct 02 '20 at 03:18
  • I honestly didn't find the set intrinsic, I just did that because that's the only way I found to zero. There are so many intrinsics on that page! – gudenau Oct 02 '20 at 17:30

1 Answers1

1

Good question. As you found out, SSE has no integer divide instruction, and (unlike ARM NEON) it doesn’t have multiplication or FMA for bytes.

Here’s what I usually do instead. The code below splits vectors into even/odd bytes, uses 16-bit multiplication instructions to scale separately, then merges them back into bytes.

// Linear interpolation is based on the following formula: x*(1-s) + y*s which can equivalently be written as x + s(y-x).
class LerpBytes
{
    // Multipliers are fixed point numbers in 16-bit lanes of these vectors, in 1.8 format
    __m128i mulX, mulY;

public:

    LerpBytes( uint16_t progress, uint16_t total )
    {
        // The source and result are bytes.
        // Multipliers only need 1.8 fixed point format, anything above that is wasteful.
        assert( total > 0 );
        assert( progress >= 0 );
        assert( progress <= total );

        const uint32_t fp = (uint32_t)progress * 0x100 / total;
        mulY = _mm_set1_epi16( (short)fp );
        mulX = _mm_set1_epi16( (short)( 0x100 - fp ) );
    }

    __m128i lerp( __m128i x, __m128i y ) const
    {
        const __m128i lowMask = _mm_set1_epi16( 0xFF );

        // Split both vectors into even/odd bytes in 16-bit lanes
        __m128i lowX = _mm_and_si128( x, lowMask );
        __m128i highX = _mm_srli_epi16( x, 8 );
        __m128i lowY = _mm_and_si128( y, lowMask );
        __m128i highY = _mm_srli_epi16( y, 8 );

        // That multiply instruction has relatively high latency, 3-5 cycles.
        // We're lucky to have 4 vectors to handle.
        lowX = _mm_mullo_epi16( lowX, mulX );
        lowY = _mm_mullo_epi16( lowY, mulY );
        highX = _mm_mullo_epi16( highX, mulX );
        highY = _mm_mullo_epi16( highY, mulY );

        // Add the products
        __m128i low = _mm_adds_epu16( lowX, lowY );
        __m128i high = _mm_adds_epu16( highX, highY );

        // Pack them back into bytes.
        // The multiplier was 1.8 fixed point, trimming the lowest byte off both vectors.
        low = _mm_srli_epi16( low, 8 );
        high = _mm_andnot_si128( lowMask, high );
        return _mm_or_si128( low, high );
    }
};

static void print( const char* what, __m128i v )
{
    printf( "%s:\t", what );
    alignas( 16 ) std::array<uint8_t, 16> arr;
    _mm_store_si128( ( __m128i * )arr.data(), v );
    for( uint8_t b : arr )
        printf( " %02X", (int)b );
    printf( "\n" );
}

int main()
{
    const __m128i x = _mm_setr_epi32( 0x33221100, 0x77665544, 0xBBAA9988, 0xFFEEDDCC );
    const __m128i y = _mm_setr_epi32( 0xCCDDEEFF, 0x8899AABB, 0x44556677, 0x00112233 );
    LerpBytes test( 0, 1 );
    print( "zero", test.lerp( x, y ) );
    test = LerpBytes( 1, 1 );
    print( "one", test.lerp( x, y ) );
    test = LerpBytes( 1, 2 );
    print( "half", test.lerp( x, y ) );
    test = LerpBytes( 1, 3 );
    print( "1/3", test.lerp( x, y ) );
    test = LerpBytes( 1, 4 );
    print( "1/4", test.lerp( x, y ) );
    return 0;
}
Soonts
  • 20,079
  • 9
  • 57
  • 130
  • This looks very promising, I'll try this out when I get a chance and report back. – gudenau Oct 02 '20 at 17:29
  • Does there happen to be a way to load the values into a register from a pointer? – gudenau Oct 03 '20 at 01:43
  • I also must be missing something stupid because I am not seeing a way to reverse _mm_setr_epi32. – gudenau Oct 03 '20 at 02:23
  • @gudenau See `_mm_load_si128` and `_mm_loadu_si128`, there're more of them but these two are the most popular ones for 16-byte integer vectors. – Soonts Oct 03 '20 at 09:11