I have the following multiplication between 2 16 bit vectors:
int16x8_t dx;
int16x8_t dy;
int16x8_t dxdy = vmulq_s16(dx, dy);
In case dx
and dy
are both large enough, the result will overflow.
I would like to clamp the resulting multiplication between the values of MIN_INT16 and MAX_INT16;
I have not found a way to do that without first converting the values to int32. This is what I do now:
int32x4_t dx_low4 = vmovl_s16(simde_vget_low_s16(dx)); // get lower 4 elements and widen
int32x4_t dx_high4 = vmovl_high_s16(dx); // widen higher 4 elements
int32x4_t dy_low4 = vmovl_s16(simde_vget_low_s16(dy)); // get lower 4 elements and widen
int32x4_t dy_high4 = vmovl_high_s16(dy); // widen higher 4 elements
int32x4_t dxdy_low = vmulq_s32(dx_low4, dy_low4);
int32x4_t dxdy_high = vmulq_s32(dx_high4, dy_high4);
// combine and handle saturation:
int16x8_t dxdy = vcombine_s16(vqmovn_s32(dxdy_low), vqmovn_s32(dxdy_high));
Is there a way to achieve this more efficiently?