Bitwise operations with Streaming Simd Extensions (SSE)

Question

I would like to understand the potential gain of using Streaming Simd Extensions (SSE) for bitwise operations between integers in the following minimal example in C.

Suppose that one

makes a bitwise operation between two 64-bit unsigned long long ints a and b (1), e.g., a ^ b
make the same bitwise operation between two 128-bit integers A and B with SSE.

I would like to know whether executing (1) takes the same time as (2).

For example, one may try a timing experiment, where one measures the time to make N >> 1 bitwise operations (1) and the time to make the same number of operations (2).

Are these times roughly the same? If not, what would be their ratio on a specific machine? How about the same question for 256 or larger SSE extensions?

Peter Cordes · Answer 1 · 2018-10-08T20:39:04.857

Are you talking about as part of a compiled C function? Compilers can easily auto-vectorize loops over arrays with AVX2 vpxor or AVX1 vxorps, so how a ^ operator compiles depends on the surrounding context.

Obviously you have to compile with optimization enabled for any benchmark to be meaningful.

As far as what the hardware can do at an asm level, compiler-generated or hand-written doesn't matter; using intrinsics is a convenient way to get compilers to emit SIMD instructions.

Let's take Intel Haswell as an example. With no memory bottlenecks, just operating on local variables in registers, with AVX2 you can get 3x vpxor ymm per clock (plus one other non-SIMD uop), so that's 3x 256 bits of XOR. (128-bit SSE2 pxor xmm has the same throughput as 256-bit AVX2 vpxor, on Intel CPUs, so wider vectors are pure win for throughput).

Or with purely scalar code you can do 4x scalar 8/16/32/64-bit xor per clock on Haswell, if you have no other instructions.

Both vpxor and xor are single uop, with 1 cycle latency.

On AMD Bulldozer-family and earlier, pxor / vpxor has 2 cycle latency, but 2 per clock throughput, so the perf difference between a latency bottleneck vs. a throughput bottleneck is a factor of 4.

CPU performance on such small scales is not one-dimensional. Superscalar pipelined out-of-order CPUs make the question you're asking of "does one take longer" too simplistic. See my answer on What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?, specifically the "There are three major dimensions to analyze for a short block" section.

See https://agner.org/optimize/ and other performance links in the x86 tag wiki.

So the AVX version gets more done per clock. It's the same story over on PowerPC's Altivec (may it rest in obscure peace...) and Cell's SPEs (basically the same thing). Presumably there may be some data alignment requirements for AVX, which if not met would slow things down significantly in comparison to scalar code? — bazza, Oct 08 '18 at 20:07
@Peter Cordes I understand that compilers can auto-vectorize loops. However, I mentioned the loop only to measure execution times by repeating many times the same operation. If one could measure the time of a single XOR operation, my question would have been: does a single 64-bit XOR take the same time as a single 128-bit XOR with SSE? I revised my question to clarify this point. — James, Oct 08 '18 at 20:19
@James: both are single-uop, 1c latency on Intel CPUs, and Ryzen. But on Haswell and later, scalar xor has 4 per clock throughput instead of 3 for vector uops. The ALU on port 6 (new in Haswell) only supports scalar instructions (including taken branches so it's nice for throughput in small loops). CPU performance on such small scales is not one-dimensional — Peter Cordes, Oct 08 '18 at 20:36

Bitwise operations with Streaming Simd Extensions (SSE)

1 Answers1