3

I want to compare two floating point variables using intrinsics. If the comparison is true, do something else do something. I want to do this as a normal if..else condition. Is there any way using intrinsics?

//normal code
vector<float> v1, v2;
for(int i = 0; i < v1.size(); ++i)
if(v1[i]<v2[i])
{
    //do something
}
else
{
    //do something
)

How to do this using SSE2 or AVX?

  • Would depend on the intrinsics you are using. The typical way is to produce a mask and then use that mask to determine which value to use of the two possible scenarios. – Mats Petersson Jun 24 '16 at 06:40
  • Why would you do this using intrinsics? Why do you expect that the compiler won't generate optimal code using the built-in comparison operators? Did you try it? – Cody Gray - on strike Jun 24 '16 at 07:02
  • Posting some actual code would certainly help... – Mats Petersson Jun 24 '16 at 07:15
  • It would also help if you told us what your target architecture is. I'm assuming x86, since you have tagged this [visual-c++], but there are a lot of different instruction sets supported by the x86 family. Are you targeting 32-bit or 64-bit processors? Do you have to limit yourself to the x87 FPU, or can you use SSE? What about SSE2, or even newer instruction sets? – Cody Gray - on strike Jun 24 '16 at 07:46
  • I am using win32 console application. I can use SSE2 and newer instructions up to AVX. – sharath chandra Jun 24 '16 at 09:26
  • Well, again, why do you want to do this using intrinsics? Are you hoping to take advantage of parallelism, and test multiple values at once? I don't see how that would help you, since you have to `// do something` after each comparison. When I compile your code with Visual C++ using the appropriate compiler options, it does the comparison with either `comiss` (SSE) or `vcomiss` (AVX). The comparison will not be the bottleneck here, it will be the conditional branching, and without knowing what "something" you are doing, there is no possible way to consider eliminating the branches. – Cody Gray - on strike Jun 24 '16 at 15:40
  • @Cody Actually the vectors `v1` and `v2` are in the form of `__m256` which are the results of other intrinsic operations. I can extract them and do the normal operation but I don't want to do that way. I need to work only with `__m256` – sharath chandra Jun 25 '16 at 06:23
  • That detail should have been mentioned in the question. That is exactly the kind of relevant fact that I'd been asking you for two days. – Cody Gray - on strike Jun 25 '16 at 11:46

3 Answers3

2

If you expect that v1[i] < v2[i] is almost never true, almost always true, or usually stays the same for a long run (even if overall there might be no particular bias), then an other technique is also applicable which offers "true conditionality" (ie not "do both, discard one result"), a price of course, but you also get to actually skip work instead of just ignoring some results.

That technique is fairly simple, do the comparison (vectorized), gather the comparison mask with _mm_movemask_ps, and then you have 3 cases:

  • All comparisons went the same way and they were all false, execute the appropriate "do something" code that is now maybe easier to vectorize since the condition is gone.
  • All comparisons went the same way and they were all true, same.
  • Mixed, use more complicated logic. Depending on what you need, you could check all bits separately (falling back to scalar code, but now just 1 FP compare for the whole lot), or use one of the "iterate only over (un)set bits" tricks (combines well with bitscan to recover the actual index), or sometimes you can fall back to doing masking and merging as usual.

Not all 3 cases are always relevant, usually you're applying this because the predicate almost always goes the same way, making one of the "all the same" cases so rare that you can just lump it in with "mixed".

This technique is definitely not always useful. The "mixed" case is complicated and slow. The fast-path has to be common and fast enough to be worth testing whether you're can take it.

But it can be useful, maybe one of the sides is very slow and annoying, while the other side of the branch is nice simple vectorizable code that doesn't take all that long in comparison. For example, maybe the slow side has to do argument reduction for an otherwise fast approximated transcendental function, or maybe it has to normalize some vectors before taking their dot product, or orthogonalize a matrix, maybe even get data from disk..

Or, maybe neither side is exactly slow, but they evict each others data from cache (maybe both sides are a loop over an array that fits in cache, but the arrays don't fit in it together) so doing them unconditionally slows both of them down. This is probably a real thing, but I haven't seen it in the wild (yet).

Or, maybe one side cannot be executed unconditionally, doing some generally destructive things, maybe even some IO. For example if you're checking for error conditions and logging them.

harold
  • 61,398
  • 6
  • 86
  • 164
1

SIMD conditional operations are done with branchless techniques. You use a packed-compare instruction to get a vector of elements that are all-zero or all-one.

e.g. you can conditionally add 4 to elements in an accumulator when a corresponding element matches a condition with code like:

__m128i match_counts = _mm_setzero_si128();

for (...) {
    __m128  fvec = something;
    __m128i  condition = _mm_castps_si128( _mm_cmplt_ps(fvec, _mm_setzero_ps()) );  // for elements less than zero
    __m128i masked_constant = _mm_and_si128(condition, _mm_set1_epi32(4));
    match_counts = _mm_add_epi32(match_counts, masked_constant);
}

Obviously this only works well if you can come up with a branchless way to do both sides of the branch. A blend instruction can often help.

It's likely that you won't get any speedup at all if there's too much work in each side of the branch, especially if your element size is 4 bytes or larger. (SIMD is really powerful when you're doing 16 operations in parallel on 16 separate bytes, less powerful when doing 4 operations on four 32-bit elements).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Why AND with `1` when you can subtract with `-1` using `_mm_sub_epi32(match_counts, condition)` and save an instruction. That's how I do it. – Z boson Jun 26 '16 at 08:25
  • @Zboson: That's why I ANDed with `set1(4)`, so my example was still trivial but this optimization didn't directly apply. (Of course, using sub to count matches and then `_mm_psrli_epi32(match_counts, 2)` outside the loop would be better.) – Peter Cordes Jun 26 '16 at 12:01
1

I found a document which is very useful for conditional SIMD instructions. It is a perfect solution to my question. If...else condition

Document: http://saluc.engr.uconn.edu/refs/processors/intel/sse_sse2.pdf