0

consider we have this:

....
pxor            xmm1, xmm1
movdqu          xmm0, [reax]
pcmpeqb         xmm0, xmm1
pmovmskb        eax,  xmm0
test            ax , ax
jz              .zero
...

is there any way to not use 'pmovmskb' and test the bitmask directly from xmm0 (to check if it's zero) ? is there any SSE instruction for this action ?

in fact, im searching for something like 'ptest xmm0, xmm0' action but in SSE2 ... not SSE4

ELHASKSERVERS
  • 195
  • 1
  • 10
  • Note that as `pcmpeqb` sets the fields to `0xff` on equality, you need `cmp ax, 0xffff` or `cmp eax, 0xffff` instead of `test ax, ax`. – fuz Feb 28 '20 at 11:25
  • I've voted to close this question as “needs more clarity” because it is not clear if you want to check if any byte in `xmm0` is zero or if all them are zero. Please clarify this and I will retract my vote. – fuz Feb 28 '20 at 11:44
  • 2
    Your code is checking if there are any non-zero bytes in `xmm0`, not whether `xmm0` is all zero. Can you clarify if that is what you want? Maybe also say what the context of that test is? (Is it really critical or are you micro-optimizing?) – chtz Feb 28 '20 at 11:44
  • 1
    If you are indeed looking for something like the `ptest` instruction, then the sequence you have is already the best option (after you use `cmp eax, 0xffff` instead of `test ax, ax`). This has been asked a bunch of times before. Also, if you want others to notice that you changed something, write a comment with an @-mention of the person you want to respond to. – fuz Feb 28 '20 at 13:25
  • If you want to check that 128 consecutive bits from memory are all zero, you are likely better using something like `mov rdx, [rax]; or rdx, [rax+8]; je .zero;`. But as said above you need to show more context of what you actually want to achieve ... – chtz Feb 28 '20 at 14:02

2 Answers2

1

It's generally not worth using SSE4.1 ptest xmm0,xmm0 on a pcmpeqb result, especially not if you're branching.

pmovmskb is 1 uop, and cmp or test can macro-fuse with jnz into another single uop on both Intel and AMD CPUs. Total of 2 uops to branch on a pcmpeqb result with pmovmsk + test/jcc

But ptest is 2 uops, and its 2nd uop can't macro-fuse with a following branch. Total of 3 uops to branch on a vector with ptest + jcc.


It's break-even when you can use ptest directly, without needing a pcmp, e.g. testing any / all bits in the whole vector (or with a mask, some bits). And actually a win if you use it for cmov or setcc instead of a branch. It's also a win for code-size, even though same number of uops.


You can amortize the checking over multiple vectors. e.g. por some vectors together and then check that all of the bytes zero. Or pminub some vectors together and then check for any zeros. (glibc string functions like strlen and strchr use this trick to check a whole cache-line of vectors in parallel, before sorting out where it came from after leaving the loop.)

You can combine pcmpeq results instead of raw inputs, e.g. for memchr. In that case you can use pand instead of pminub to get a zero in an element where any input has a zero. Some CPUs run pand on more ports than pminub, so less competition for vector ALU.


Also note that pmovmskb zero-extends into EAX; you can test eax,eax instead of wasting a prefix byte to only test AX.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • a question ... about 'test eax,eax' ... so you telling me if we are in long Mode (64-bit), it's better to do like this 'test rax,rax' ? – ELHASKSERVERS Feb 28 '20 at 16:11
  • 1
    @ELHASKSERVERS No. `test rax, rax` needs a prefix byte (as you already noticed in one of your previous questions) while `test eax, eax` does not. – fuz Feb 28 '20 at 18:34
  • 1
    @ELHASKSERVERS: The default operand-size in long mode is 32-bit. Use that when you have a choice to save code size (i.e. when it doesn't cost any extra instructions), for the same reason you zero a 64-bit register with `xor eax,eax` – Peter Cordes Feb 29 '20 at 20:28
0

Use ptest:

ptest xmm0, xmm0
jz .zero

ptest a, b sets ZF if ab is zero and CF if a ∧ ¬ b is zero.

Note however that SSE 4.1 is required for ptest to be present.

Otherwise, I suppose your approach is as good as it gets.

fuz
  • 88,405
  • 25
  • 200
  • 352
  • is there any other way in SSE2 ? – ELHASKSERVERS Feb 28 '20 at 10:50
  • @ELHASKSERVERS None that I know of that is better than what you already have. See the linked question for details. – fuz Feb 28 '20 at 11:23
  • @ELHASKSERVERS Also see [this answer](https://stackoverflow.com/q/27905677/417501). – fuz Feb 28 '20 at 11:26
  • @fuz The linked questions check for all-zeros. The question above checks for all bytes being non-zero. Using `ptest` (after `pcmpeqb`) seems to be the most efficient way for that (if available). – chtz Feb 28 '20 at 11:38
  • @chtz I suppose OPs code is wrong sice the text below clearly states that he wants to check if the register is all zero. – fuz Feb 28 '20 at 11:40