How to optimization long series of If/then conditional expressions - SIMD

Question

I'm using SIMD for improving the performance of C code, but I encountered a function with many if/then condition as below:

if (Di <= -T3) return  -4;
if (Di <= -T2) return  -3;
if (Di <= -T1) return  -2;
if (Di < -NEAR)  return  -1;
if (Di <=  NEAR) return   0;
if (Di < T1)   return   1;
if (Di < T2)   return   2;
if (Di < T3)   return   3;

return  4;

Using Intel intrinsic functions which were supported in VC++ compiler give out a slower processing time.

Thus are there any better way to optimize this long series of conditional expression ?

@black I think this would actually be one of the few questions that could be applicable to both languages... (As the answers will likely be valid for both.) — anderas, Jul 31 '15 at 10:04
He actually refers to both, "C code" and the "VC++ compiler". @KhaiNguyen is your question specific to C or C++, or relevant to both? — anderas, Jul 31 '15 at 10:08
What intel intrinsics have you used? Could you show the code for that experiment? — anderas, Jul 31 '15 at 10:11
If the order of `T[n]`s is fixed, a binary search is probably faster than going through the conditions linearly. — EOF, Jul 31 '15 at 10:24
@EOF yes, the order is fixed, I have not thought about binary search at first — Khai Nguyen, Jul 31 '15 at 11:50
@black Im using MSVC2010 compiler. IMHO, this is relevant for both C or C++ — Khai Nguyen, Jul 31 '15 at 11:52
If you have a lot of memory to cover [-T3,T3], then LUT will be the easiest. — user3528438, Jul 31 '15 at 12:52
http://stackoverflow.com/questions/6754454/speed-difference-between-if-else-and-ternary-operator-in-c/31198658#31198658 — kchoi, Jul 31 '15 at 16:00

stgatilov · Accepted Answer · 2015-07-31T11:44:21.083

I assume several things:

You deal with int32 data (it can easily be changed to float32, though).
You can pass 4 values to your function at once (not just one). That's what people usually mean by vectorization.
Constants are sorted, i.e. 0 < NEAR < T1 < T2 < T3.

Here is a vectorized function:

__m128i func4(__m128i D) {
  __m128i cmp_m3 = _mm_cmpgt_epi32(D, _mm_set1_epi32(-T3));
  __m128i cmp_m2 = _mm_cmpgt_epi32(D, _mm_set1_epi32(-T2));
  __m128i cmp_m1 = _mm_cmpgt_epi32(D, _mm_set1_epi32(-T1));
  __m128i cmp_p0 = _mm_cmpgt_epi32(D, _mm_set1_epi32(NEAR));
  __m128i reduce_true = _mm_add_epi32(_mm_add_epi32(cmp_m3, cmp_m2), _mm_add_epi32(cmp_m1, cmp_p0));
  __m128i cmp_m0 = _mm_cmplt_epi32(D, _mm_set1_epi32(-NEAR));
  __m128i cmp_p1 = _mm_cmplt_epi32(D, _mm_set1_epi32(T1));
  __m128i cmp_p2 = _mm_cmplt_epi32(D, _mm_set1_epi32(T2));
  __m128i cmp_p3 = _mm_cmplt_epi32(D, _mm_set1_epi32(T3));
  __m128i reduce_false = _mm_add_epi32(_mm_add_epi32(cmp_p3, cmp_p2), _mm_add_epi32(cmp_p1, cmp_m0));
  return _mm_sub_epi32(reduce_false, reduce_true);
}

If the input data is random, then it works 11 times faster than your original version on Ivy Bridge with MSVC2013 x64:

Time = 4.436   (-39910000)
Time = 0.409   (-39910000)

The full code with testing is available here.

The idea is rather simple. You can see non-vectorized version of the proposed solution in the function funcX following the link above. It may explain everything better than words.

We take a register D as input, it contains 4 packed values. Then we compare it against all the 8 constants you have with _mm_cmp* intrinsic. This comparison produces 8 bitmasks cmp_pX, cmp_mX. In a bitmask all the bits corresponding to a number are either zero or one. 32 zero bits are set for each comparison, which was false. If the condition of comparison was true, then the 32 bits are set to 1.

Now recall that 32-bit integer with all one bits is -1 in signed representation. When we add four results of comparison together, we obtain a pack of counts negated. Finally, we take the difference of two counts, and it is the desired result.

P.S. Here is the assembly code generated for inner loop:

movdqa  xmm3, XMMWORD PTR [rcx]
movdqa  xmm4, xmm10
movdqa  xmm0, xmm9
add rcx, 16
pcmpgtd xmm0, xmm3
pcmpgtd xmm4, xmm3
paddd   xmm4, xmm0
movdqa  xmm2, xmm3
movdqa  xmm1, xmm8
pcmpgtd xmm1, xmm3
pcmpgtd xmm2, xmm14
movdqa  xmm0, xmm7
pcmpgtd xmm0, xmm3
paddd   xmm1, xmm0
paddd   xmm4, xmm1
movdqa  xmm0, xmm3
movdqa  xmm1, xmm3
pcmpgtd xmm1, xmm12
pcmpgtd xmm0, xmm13
pcmpgtd xmm3, xmm11
paddd   xmm1, xmm3
paddd   xmm2, xmm0
paddd   xmm2, xmm1
psubd   xmm4, xmm2
paddd   xmm4, xmm5
movdqa  xmm5, xmm4
cmp rcx, r15
jl  SHORT $LL3@main

Yes, your assumptions are correct. Your solution works 2x time faster in my machine. Thank you very much. — Khai Nguyen, Jul 31 '15 at 12:01
If you target a CPU with horizontal add or fast shuffles, it may quite possibly be faster to simply do one compare for every 4 words, right-shift the result by 31, multiply by the output values, and horizontal add them. But still, very nice solution. — Damon, Jul 31 '15 at 12:12
@Damon: You proposal may work for single-input function, but I think there is no sense in horizontal operations when we deal with packed input. — stgatilov, Jul 31 '15 at 12:18

score 0 · Answer 2 · answered Jul 31 '15 at 10:27

You can try to get rid of conditions at all and measure the time again. Your code

if (Di <= -T3) return  -4;
if (Di <= -T2) return  -3;
if (Di <= -T1) return  -2;
if (Di < -NEAR)  return  -1;
if (Di <=  NEAR) return   0;
if (Di < T1)   return   1;
if (Di < T2)   return   2;
if (Di < T3)   return   3;

return  4;

can be transformed to unconditional form:

return
    (Di <= -T3)*(-4) + (Di > -T3) * (
    (Di <= -T2)*(-3) + (Di > -T2) * (
    (Di <= -T1)*(-2) + (Di > -T1) * (
    (Di < -NEAR)*(-1) + (Di >= -NEAR) * (
    (Di <=  NEAR)*0 + (Di > NEAR) * (
    (Di < T1)*1 + (Di >= T1) * (
    (Di < T2)*2 + (Di >= T2) * (
    (Di < T3)*3 + (Di >= T3) * (
    4
    ))))))));

Probably, you can further optimize this code, having some knowledge about possible content of your variables.

How to optimization long series of If/then conditional expressions - SIMD

2 Answers2