SSE2 test xmm bitmask directly without using 'pmovmskb'

Question

consider we have this:

....
pxor            xmm1, xmm1
movdqu          xmm0, [reax]
pcmpeqb         xmm0, xmm1
pmovmskb        eax,  xmm0
test            ax , ax
jz              .zero
...

is there any way to not use 'pmovmskb' and test the bitmask directly from xmm0 (to check if it's zero) ? is there any SSE instruction for this action ?

in fact, im searching for something like 'ptest xmm0, xmm0' action but in SSE2 ... not SSE4

Note that as `pcmpeqb` sets the fields to `0xff` on equality, you need `cmp ax, 0xffff` or `cmp eax, 0xffff` instead of `test ax, ax`. — fuz, Feb 28 '20 at 11:25
I've voted to close this question as “needs more clarity” because it is not clear if you want to check if any byte in `xmm0` is zero or if all them are zero. Please clarify this and I will retract my vote. — fuz, Feb 28 '20 at 11:44
Your code is checking if there are any non-zero bytes in `xmm0`, not whether `xmm0` is all zero. Can you clarify if that is what you want? Maybe also say what the context of that test is? (Is it really critical or are you micro-optimizing?) — chtz, Feb 28 '20 at 11:44
If you are indeed looking for something like the `ptest` instruction, then the sequence you have is already the best option (after you use `cmp eax, 0xffff` instead of `test ax, ax`). This has been asked a bunch of times before. Also, if you want others to notice that you changed something, write a comment with an @-mention of the person you want to respond to. — fuz, Feb 28 '20 at 13:25
If you want to check that 128 consecutive bits from memory are all zero, you are likely better using something like `mov rdx, [rax]; or rdx, [rax+8]; je .zero;`. But as said above you need to show more context of what you actually want to achieve ... — chtz, Feb 28 '20 at 14:02

Peter Cordes · Accepted Answer · 2020-02-28T13:38:18.290

It's generally not worth using SSE4.1 ptest xmm0,xmm0 on a pcmpeqb result, especially not if you're branching.

pmovmskb is 1 uop, and cmp or test can macro-fuse with jnz into another single uop on both Intel and AMD CPUs. Total of 2 uops to branch on a pcmpeqb result with pmovmsk + test/jcc

But ptest is 2 uops, and its 2nd uop can't macro-fuse with a following branch. Total of 3 uops to branch on a vector with ptest + jcc.

It's break-even when you can use ptest directly, without needing a pcmp, e.g. testing any / all bits in the whole vector (or with a mask, some bits). And actually a win if you use it for cmov or setcc instead of a branch. It's also a win for code-size, even though same number of uops.

You can amortize the checking over multiple vectors. e.g. por some vectors together and then check that all of the bytes zero. Or pminub some vectors together and then check for any zeros. (glibc string functions like strlen and strchr use this trick to check a whole cache-line of vectors in parallel, before sorting out where it came from after leaving the loop.)

You can combine pcmpeq results instead of raw inputs, e.g. for memchr. In that case you can use pand instead of pminub to get a zero in an element where any input has a zero. Some CPUs run pand on more ports than pminub, so less competition for vector ALU.

Also note that pmovmskb zero-extends into EAX; you can test eax,eax instead of wasting a prefix byte to only test AX.

a question ... about 'test eax,eax' ... so you telling me if we are in long Mode (64-bit), it's better to do like this 'test rax,rax' ? — ELHASKSERVERS, Feb 28 '20 at 16:11
@ELHASKSERVERS No. `test rax, rax` needs a prefix byte (as you already noticed in one of your previous questions) while `test eax, eax` does not. — fuz, Feb 28 '20 at 18:34
@ELHASKSERVERS: The default operand-size in long mode is 32-bit. Use that when you have a choice to save code size (i.e. when it doesn't cost any extra instructions), for the same reason you zero a 64-bit register with `xor eax,eax` — Peter Cordes, Feb 29 '20 at 20:28

score 0 · Answer 2 · answered Feb 28 '20 at 08:16

0

Use ptest:

ptest xmm0, xmm0
jz .zero

ptest a, b sets ZF if a ∧ b is zero and CF if a ∧ ¬ b is zero.

Note however that SSE 4.1 is required for ptest to be present.

Otherwise, I suppose your approach is as good as it gets.

answered Feb 28 '20 at 08:16

fuz

88,405
25
200
352

is there any other way in SSE2 ? – ELHASKSERVERS Feb 28 '20 at 10:50
@ELHASKSERVERS None that I know of that is better than what you already have. See the linked question for details. – fuz Feb 28 '20 at 11:23
@ELHASKSERVERS Also see [this answer](https://stackoverflow.com/q/27905677/417501). – fuz Feb 28 '20 at 11:26
@fuz The linked questions check for all-zeros. The question above checks for all bytes being non-zero. Using `ptest` (after `pcmpeqb`) seems to be the most efficient way for that (if available). – chtz Feb 28 '20 at 11:38
@chtz I suppose OPs code is wrong sice the text below clearly states that he wants to check if the register is all zero. – fuz Feb 28 '20 at 11:40

SSE2 test xmm bitmask directly without using 'pmovmskb'

2 Answers2

Linked