6

Is it possible to compare more than a pair of numbers in one instruction using SSE4?

Intel Reference says the following about PCMPGTQ

PCMPGTQ — Compare Packed Data for Greater Than

Performs an SIMD compare for the packed quadwords in the destination operand (first operand) and the source operand (second operand). If the data element in the first (destination) operand is greater than the corresponding element in the second (source) operand, the corresponding data element in the destination is set to all 1s; otherwise, it is set to 0s.

which is not really what I want because I want to be able to decide which integers are greater and which are smaller in the vector.

For example, if I need to compare

32 with 45
13 with 78
44 with 12
99 with 66

I was planning to put [32, 13, 44, 99] in one vector and [45, 78, 12, 66] in another vector and compare them using SSE4 in one instruction, and have [0, 0, 1, 1] as result (0 - less, 1 - greater)

But it seems this is not what PCMPGTQ does. Any suggestions on how to use parallelism at this level to speedup this comparison?

Lazer
  • 90,700
  • 113
  • 281
  • 364

1 Answers1

5

I believe that is actually what the PCMPGT family of operators does. The suffix specifies the size of the elements - B for 8-bit elements, W for 16-bit elements, D for 32-bit elements, Q for 64-bit elements. So, if you want to compare 4 32-bit numbers at once, use PCMPGTD with 128-bit vector arguments. See this page for a pseudocode description of these opcodes.

They don't write just 1 or 0, though; they write all-ones or all-zeroes to each element, so that comparing 0x1234567887654321 against 0x8765432112345678 using PCMPGTB should give 0x0000FFFFFFFF0000.

This Intel white paper gives a neat example of performing the operation a[i] = (a[i] > b[i]) ? a[i] : b[i] (i.e. a[i] = max(a[i], b[i])) using vector operations.

nneonneo
  • 171,345
  • 36
  • 312
  • 383
  • Thanks, makes some sense. But looks like this speedup doesnt give me much. I was able to compare 4 at a time, but how do I use the results? I will probably need to do something like `if ((int)result[0] == 0)` and thus will incur another comparison. I am definitely missing something. – Lazer Sep 24 '12 at 04:21
  • @Lazer It depends on the surrounding code. If you're packing data just for a single comparison, then there's no way you'll get any speedup. You need to do a LOT of work while the data is packed before you will get any worthwhile speedup. – Mysticial Sep 24 '12 at 04:24
  • Well, of course it depends on what you want to do with it! There's a lot of neat tricks you can do with `xor` and `and` against this sort of all-ones vector. – nneonneo Sep 24 '12 at 04:24
  • Thanks nneonneo and @Mystical, I think I understand. I just need to figure a way to use these results in an optimal way. – Lazer Sep 24 '12 at 04:27
  • @Lazer There's a reason why they have it return all 1s. Then you can use it as a mask to do cool stuff like vector conditional moves and such. – Mysticial Sep 24 '12 at 04:34
  • Yeah, see the link I (just) posted. They do basically `C=A; C=A>C; A^=B; C&=A; C^=B` to do `C = max(A,B)` with vectors. – nneonneo Sep 24 '12 at 04:38
  • Both links are now broken. Nobody cares about link stability anymore? Intel, come on! [New URL for intel while paper (PDF)](http://www.intel.in/content/dam/www/public/us/en/documents/white-papers/ia-32-64-assembly-lang-paper.pdf) – bluss Jul 03 '15 at 18:09
  • @bluss: thanks! I've corrected the link in my answer. – nneonneo Jul 03 '15 at 18:26