It's generally not worth using SSE4.1 ptest xmm0,xmm0
on a pcmpeqb
result, especially not if you're branching.
pmovmskb
is 1 uop, and cmp
or test
can macro-fuse with jnz
into another single uop on both Intel and AMD CPUs. Total of 2 uops to branch on a pcmpeqb result with pmovmsk + test/jcc
But ptest
is 2 uops, and its 2nd uop can't macro-fuse with a following branch. Total of 3 uops to branch on a vector with ptest
+ jcc.
It's break-even when you can use ptest
directly, without needing a pcmp
, e.g. testing any / all bits in the whole vector (or with a mask, some bits). And actually a win if you use it for cmov or setcc instead of a branch. It's also a win for code-size, even though same number of uops.
You can amortize the checking over multiple vectors. e.g. por
some vectors together and then check that all of the bytes zero. Or pminub
some vectors together and then check for any zeros. (glibc string functions like strlen and strchr use this trick to check a whole cache-line of vectors in parallel, before sorting out where it came from after leaving the loop.)
You can combine pcmpeq results instead of raw inputs, e.g. for memchr. In that case you can use pand
instead of pminub
to get a zero in an element where any input has a zero. Some CPUs run pand
on more ports than pminub
, so less competition for vector ALU.
Also note that pmovmskb zero-extends into EAX; you can test eax,eax
instead of wasting a prefix byte to only test AX.