The orpd
instruction is a "bitwise logical OR of packed double precision floating point values". Doesn't this do exactly the same thing as por
("bitwise logical OR")? If so, what's the point of having it?

- 328,167
- 45
- 605
- 847

- 16,609
- 6
- 58
- 83
1 Answers
Remember that SSE1 orps
came first. (Well actually MMX por mm, mm/mem
came even before SSE1.)
Having the same opcode with a new prefix be the SSE2 orpd
instruction makes sense for hardware decoder logic, I guess, just like movapd
vs. movaps
. Several instructions like this are redundant between between ps
and pd
versions, but some aren't, like addps
vs. addpd
or unpcklps
vs. unpcklpd
being different shuffles.
The reason for SSE2 also introducing 66 0F EB /r por xmm,xmm/mem
is at least partly for consistency with MMX 0F EB /r por mm, mm/mem
, again same opcode with a new mandatory prefix. Just like paddb mm, mm
vs. paddb xmm, xmm
.
But also for the possibility of different bypass-forwarding domains for vec-integer vs. FP. Different microarchitectures have had different behaviours for how they actually decoded and ran those different instructions. Some ran all the XMM or
instructions the same way, creating extra latency for forwarding between FP and simd-integer domains.
No CPUs have ever actually had different fowarding domains for FP-float vs. FP-double, so yes, movapd
and orpd
are in practice useless wastes of space that you should never use. Use the smaller orps
encoding instead.
(Or with VEX encoding it doesn't matter; vorps
and vorpd
are the same size: 2 byte prefix + opcode + modrm ...)
por
vs. orps
For more about bypass delay when using por
between FP math instructions like addps
, or orps
between SIMD-integer insns like paddb
, see
- Do I get a performance penalty when mixing SSE integer/float SIMD instructions
- What's the difference between logical SSE intrinsics?
- Difference between the AVX instructions vxorpd and vpxor
- Does using mix of pxor and xorps affect performance?
- Is there any situation where using MOVDQU and MOVUPD is better than MOVUPS?
- Choosing SSE instruction execution domains in mixed contexts - pre-Skylake, integer versions have better throughput.
And in case anyone was wondering, the answer to the other interpretation of the title: bitwise booleans on FP values are mostly used to set, clear, or toggle the sign bit. Or to do stuff with cmpps/pd
masks like blending.

- 328,167
- 45
- 605
- 847
-
Is there any effect on performance if switching between integer and floating point instructions? – rcgldr May 31 '20 at 05:56
-
1@rcgldr: yes, [What's the difference between logical SSE intrinsics?](https://stackoverflow.com/a/31233017) – Peter Cordes May 31 '20 at 06:01
-
If I recall correctly, the first implementation of the AMD Opteron incurred a delay when switching between SIMD data types (even of the same width). I think that this applied to float vs double as well as FP vs int, but it has been about 15 years.... – John D McCalpin Jun 04 '20 at 16:21
-
@JohnDMcCalpin: Agner's uarch guide says float in general is one domain on K8/K10. Maybe you're thinking of Bulldozer-family where FMA/math instructions of different width have a reformatting delay, like `addps` -> `addpd`. That was only for actual math, not booleans or shuffles. `addps` could forward to `unpcklpd` without delay, so it's not evidence of 3 separate bypass networks, or any justification for the existence of `orpd`. It is evidence that BD kept extra stuff with vector elements to be reused, and also had special case lower latency when forwarding FMA->FMA than orps -> FMA. – Peter Cordes Jun 04 '20 at 16:46
-
@JohnDMcCalpin: TL:DR: Definitely was a thing on Bulldozer-family (and Jaguar/Bobcat), not K8/K10, but not for `orpd` / `orps`. Only actual math. Assuming Agner Fog's summary is fully accurate. For K8/K10 did consider a reformatting delay as a hypothesis, but rejected it based on only seeing extra latency between i-vec and FP, not between single and double precision instructions on that uarch family. – Peter Cordes Jun 04 '20 at 16:50