I assume several things:
- You deal with int32 data (it can easily be changed to float32, though).
- You can pass 4 values to your function at once (not just one). That's what people usually mean by vectorization.
- Constants are sorted, i.e. 0 < NEAR < T1 < T2 < T3.
Here is a vectorized function:
__m128i func4(__m128i D) {
__m128i cmp_m3 = _mm_cmpgt_epi32(D, _mm_set1_epi32(-T3));
__m128i cmp_m2 = _mm_cmpgt_epi32(D, _mm_set1_epi32(-T2));
__m128i cmp_m1 = _mm_cmpgt_epi32(D, _mm_set1_epi32(-T1));
__m128i cmp_p0 = _mm_cmpgt_epi32(D, _mm_set1_epi32(NEAR));
__m128i reduce_true = _mm_add_epi32(_mm_add_epi32(cmp_m3, cmp_m2), _mm_add_epi32(cmp_m1, cmp_p0));
__m128i cmp_m0 = _mm_cmplt_epi32(D, _mm_set1_epi32(-NEAR));
__m128i cmp_p1 = _mm_cmplt_epi32(D, _mm_set1_epi32(T1));
__m128i cmp_p2 = _mm_cmplt_epi32(D, _mm_set1_epi32(T2));
__m128i cmp_p3 = _mm_cmplt_epi32(D, _mm_set1_epi32(T3));
__m128i reduce_false = _mm_add_epi32(_mm_add_epi32(cmp_p3, cmp_p2), _mm_add_epi32(cmp_p1, cmp_m0));
return _mm_sub_epi32(reduce_false, reduce_true);
}
If the input data is random, then it works 11 times faster than your original version on Ivy Bridge with MSVC2013 x64:
Time = 4.436 (-39910000)
Time = 0.409 (-39910000)
The full code with testing is available here.
The idea is rather simple.
You can see non-vectorized version of the proposed solution in the function funcX
following the link above. It may explain everything better than words.
We take a register D
as input, it contains 4 packed values.
Then we compare it against all the 8 constants you have with _mm_cmp*
intrinsic. This comparison produces 8 bitmasks cmp_pX
, cmp_mX
. In a bitmask all the bits corresponding to a number are either zero or one. 32 zero bits are set for each comparison, which was false. If the condition of comparison was true, then the 32 bits are set to 1.
Now recall that 32-bit integer with all one bits is -1 in signed representation. When we add four results of comparison together, we obtain a pack of counts negated. Finally, we take the difference of two counts, and it is the desired result.
P.S. Here is the assembly code generated for inner loop:
movdqa xmm3, XMMWORD PTR [rcx]
movdqa xmm4, xmm10
movdqa xmm0, xmm9
add rcx, 16
pcmpgtd xmm0, xmm3
pcmpgtd xmm4, xmm3
paddd xmm4, xmm0
movdqa xmm2, xmm3
movdqa xmm1, xmm8
pcmpgtd xmm1, xmm3
pcmpgtd xmm2, xmm14
movdqa xmm0, xmm7
pcmpgtd xmm0, xmm3
paddd xmm1, xmm0
paddd xmm4, xmm1
movdqa xmm0, xmm3
movdqa xmm1, xmm3
pcmpgtd xmm1, xmm12
pcmpgtd xmm0, xmm13
pcmpgtd xmm3, xmm11
paddd xmm1, xmm3
paddd xmm2, xmm0
paddd xmm2, xmm1
psubd xmm4, xmm2
paddd xmm4, xmm5
movdqa xmm5, xmm4
cmp rcx, r15
jl SHORT $LL3@main