_mm_testc_ps and _mm_testc_pd vs _mm_testc_si128

Question

As you know, the first two are AVX-specific intrinsics and the second is a SSE4.1 intrinsic. Both sets of intrinsics can be used to check for equality of 2 floating-point vectors. My specific use case is:

_mm_cmpeq_ps or _mm_cmpeq_pd, followed by
_mm_testc_ps or _mm_testc_pd on the result, with an appropriate mask

But AVX provides equivalents for "legacy" intrinsics, so I might be able to use _mm_testc_si128, after a cast of the result to __m128i. My questions are, which of the two use cases results in better performance and where I can find out what legacy SSE instructions are provided by AVX.

Apologies if you're already aware of this, but you do know that, in general, comparing floats for equality is usually a bad idea, right ? — Paul R, Mar 04 '16 at 08:24
Also note that `_mm_cmpeq_ps`/`_mm_cmpeq_pd` are SSE2, not AVX, so I don't see any AVX-specific aspect to your question ? — Paul R, Mar 04 '16 at 08:30
I know comparing floats for equality is usually bad, but not always. — user1095108, Mar 04 '16 at 08:38

score 5 · Accepted Answer · edited May 23 '17 at 12:31

5

Oops, I didn't read the question carefully. You're talking about using these after a cmpeqps. They're always slower than movmskps / test if you already have a mask. cmpps / ptest / jcc is 4 uops. cmpps / movmskps eax, xmm0 / test eax,eax / jnz is 3 uops. (test/jnz fuse into a single uop). Also, none of the instructions are multi-uop, so no decode bottlenecks.

Only use ptest / vtestps/pd when you can take full advantage of the AND or ANDN operation to avoid an earlier step. I've posted answers before where I compared ptest vs. an alternative. I think I did find one case once where ptest was a win, but it's hard to use. Yup, found it: someone wanted an FP compare that was true for NaN == NaN. It's one of the only times I've ever found a use for the carry flag result of ptest.

If the high element of a compare result is "garbage", then you can still ignore it cheaply with movmskps:

_mm_movemask_ps(vec) & 0b0111 == 0  // tests for none of the first three being true

This is totally free. The x86 test instruction works a lot like ptest: You can use it with an immediate mask instead of to test a register against itself. (It actually has a tiny cost: one extra byte of machine code, because test eax, 3 is one byte longer than test eax, eax, but they run identically.).

See the x86 wiki for links to guides (Agner Fog's guide is good for perf analysis at the instruction level). There's an AVX version of every legacy SSE instruction, but some are only 128 bits wide. They all get an extra operand (so the dest doesn't have to be one of the src regs), which saves on mov instructions to copy registers.

Answer to a question you didn't ask:

Neither _mm_testc_ps nor _mm_testc_si128 can be used to compare floats for equality. vtestps is like ptest, but only operates on the sign bits of each float element.

They all compute (~x) & y (on sign bits or on the full register), which doesn't tell you whether they're equal, or even whether the sign bits are equal.

Note that even checking for bitwise equality of floats (with pcmpeqd) isn't the same as cmpeqps (which implements C's == operator), because -0.0 isn't bitwise equal to 0.0. And two bitwise-identical NaNs aren't equal to each other. The comparison is unordered (which means not equal) if either or both operand is NaN.

edited May 23 '17 at 12:31

Community

1
1

answered Mar 04 '16 at 08:31

Peter Cordes

328,167
45
605
847

I see, sometimes I need to make use of the mask, for example, when I only use a part of the register (say 3 out of 4, or 2 out of 4). This is the use case where the mask becomes useful, so that I don't need to do an additional AND on the `movmskps` result. – user1095108 Mar 04 '16 at 08:42
1

@user1095108: you can mask the result of `movmskps`. e.g. `_mm_movemask_ps(vec) & 0b0111 == 0b0111` tests for all of the first three elements being true, without caring about the 4th. `_mm_movemask_ps(vec) & 0b0111 == 0` tests for none of the first three being true, and is totally free. (The x86 `test` instruction works a lot like `ptest`: You can use it with an immediate mask instead of to test a register against itself.) – Peter Cordes Mar 04 '16 at 08:44
How about `_mm_movemask_epi8` after the compare? Is that better than `_mm_movemask_ps`? – user1095108 Mar 04 '16 at 08:48
1

@user1095108: no, it has no new information, and there's possibly a bypass delay for sending float data to an integer instruction on some CPUs. Using it just means you need to repeat each bit 4 or 8 times in your mask, which could increase code size (`test eax, 3` can use an imm8 for the `3`, but `test eax, 0b0000111111111111` needs to use the `test eax, imm32` encoding: 5 bytes total). So it's larger code size with no advantage, and possible disadvantage. Generally only use `movemask_epi8` after packed integer compares, esp. if the element size was smaller than 4B. – Peter Cordes Mar 04 '16 at 08:53
1

Or of course for bithack stuff where you can do something with the high bit of every byte, even when it's not a comparison result. – Peter Cordes Mar 04 '16 at 08:55
Ok, final question, where can I gain info about the uops (microops?) you mention? I'd like to do my own uop counts. – user1095108 Mar 04 '16 at 08:56
1

I already linked the x86 tag wiki in my answer :P To really understand when latency vs. throughput vs. total uops matters, and when there are other delays you have to account for, you need to read and understand Agner Fog's microarchitecture guide for the CPU you're tuning for. Fortunately SnB-family CPUs (with their uop cache and loop buffer) make uop counting fairly directly useful. On other uarches, decode is more often a bottleneck than the pipeline issue width. But the frontend matters, esp. when pushing over 3 uops per clock. Then it's trial and error with insn ordering/alignment. – Peter Cordes Mar 04 '16 at 09:00

_mm_testc_ps and _mm_testc_pd vs _mm_testc_si128

1 Answers1

Linked