What is the most efficient way to handle integer multiplication overflow with saturation with ARM Neon intrinsics?

Question

I have the following multiplication between 2 16 bit vectors:

int16x8_t dx;
int16x8_t dy;
int16x8_t dxdy = vmulq_s16(dx, dy);

In case dx and dy are both large enough, the result will overflow.

I would like to clamp the resulting multiplication between the values of MIN_INT16 and MAX_INT16;

I have not found a way to do that without first converting the values to int32. This is what I do now:

int32x4_t dx_low4 = vmovl_s16(simde_vget_low_s16(dx)); // get lower 4 elements and widen
int32x4_t dx_high4 = vmovl_high_s16(dx); // widen higher 4 elements
int32x4_t dy_low4 = vmovl_s16(simde_vget_low_s16(dy)); // get lower 4 elements and widen
int32x4_t dy_high4 = vmovl_high_s16(dy); // widen higher 4 elements
    
int32x4_t dxdy_low = vmulq_s32(dx_low4, dy_low4);
int32x4_t dxdy_high = vmulq_s32(dx_high4, dy_high4);
// combine and handle saturation:    
int16x8_t dxdy = vcombine_s16(vqmovn_s32(dxdy_low), vqmovn_s32(dxdy_high));

Is there a way to achieve this more efficiently?

score 4 · Accepted Answer · answered Jan 06 '22 at 20:03

Here’s another version. It does pretty much the same as your code, but uses fewer instructions for that, e.g. NEON has widening multiplication. I’m not sure if it’s faster or slower (apparently there’s no searchable NEON instruction timings anywhere on the internets). Untested, but the code looks good, 4 instructions.

inline int16x8_t saturatingMultiply( int16x8_t dx, int16x8_t dy )
{
    // Multiply + widen lower 4 lanes; vget_low_s16 is free, compiles into no instructions
    const int32x4_t low32 = vmull_s16( vget_low_s16( dx ), vget_low_s16( dy ) );
    // Multiply + widen higher 4 lanes
    const int32x4_t high32 = vmull_high_s16( dx, dy );
    // Saturate + narrow lower 4 lanes
    const int16x4_t low16 = vqmovn_s32( low32 );
    // Saturate + narrow remaining 4 lanes, moving into the higher lanes of the result
    return vqmovn_high_s32( low16, high32 );
}

If you’re compiling this for ARMv7 instead of ARM64, you gonna need a few changes, use vget_high_s16 and vcombine_s16 workarounds for the missing intrinsics.

Very Nice! I can confirm that your method improved my code by about 50 microseconds (about 1ms total time). Not a lot but definitely noticeable. — Elad Maimoni, Jan 06 '22 at 20:29

What is the most efficient way to handle integer multiplication overflow with saturation with ARM Neon intrinsics?

1 Answers1