I'm trying to optimize code with AVX2 assembly. At one point I need to compare the result of computation with the threshold and write 0 or 255 byte to output. I compare with
VCMPPD ymm2,ymm1 (values here),ymm4 (thresholds here),1
Then, ymm2 holds 4 QUADWORDS 0 and 0xFF. Ideally would be to shrink it all to 4 byte in EAX. But now, I'm doing 4 VPTEST operations and several conditional jumps to form the output. This slows down the performance significantly.
Question: how to get and use the result of comparison with AVX2 effectively?