Why do processors with only AVX out-perform AVX2 processors for many SIMD algorithms?

Question

I've been investigating the benefits of SIMD algorithms in C# and C++, and found that in many cases using 128-bit registers on an AVX processor offers a better improvement than using 256-bit registers on a processor with AVX2, but I don't understand why.

By improvement I mean the speed-up of a SIMD algorithm relative to a non-SIMD algorithm on the same machine.

You're probably memory bound. 33 million shorts isn't going to fit in the CPU cache. For that matter, you'll need it fit in L1 if you want to see the difference that you're expecting. — Mysticial, Feb 26 '16 at 23:36
Do you have any way to try running the same AVX code on your AVX2-capable processor? — Ben Voigt, Feb 26 '16 at 23:40
Also, FWIW, your SIMD implementation stops one block too early. — Ben Voigt, Feb 26 '16 at 23:41
@Mysticial Can you explain further, please? I don't understand how that would explain why AVX performs better than AVX2? I don't expect to fit the entire arrays into L1. I'm processing them sequentially (predictably) so I expect they would be pre-fetched into L1 as required. — eoinmullan, Feb 26 '16 at 23:46
@eoinmullan You seem to be testing things on different machines. Saying that you get 2x speedup on Ivy Bridge with AVX doesn't mean will get more than 2x on Haswell with AVX2. This is definitely the case if the machines have different amounts of memory bandwidth. You need to do what Ben said. Run all the tests on the same machine. Otherwise you're comparing apples to oranges. — Mysticial, Feb 26 '16 at 23:49
What is your memory bus width? How many banks? Are they the same on both machines? — stark, Feb 26 '16 at 23:56
@Mysticial Please note, I'm not comparing the performance of AVX against that of AVX2. Nor am I comparing the performance of Ivy Bridge and Haswell. I'm comparing the relative speed-up of a SIMD algorithm vs a non-SIMD algorithm on AVX to the same relative speed-up on AVX2. — eoinmullan, Feb 27 '16 at 00:00
You still need to compare on the same machine, since different machines (and different models of processors) have different behaviour in regards to memory bandwidth, cache-sizes, memory speed, cache-speed, etc. If you get better speed on the same machine, with AVX than AVX2, then it's possibly a sign that something isn't quite right with the compilation - but just comparing two different machines with a whole range of different properties will not show that. — Mats Petersson, Feb 27 '16 at 00:05
@stark I'm afraid the test machines are in my work place so I can't get that info until Monday :(.I've ran this on 3 AVX machines, though, and 2 AVX2 machines, and the results were consistent, so I guess I thought there may be some commonly known reason for this behavior. — eoinmullan, Feb 27 '16 at 00:07
I ran the AVX algorithm on an AVX2 capable machine and it performed pretty much exactly the same as the AVX2 algorithm on that same machine. Note, this was my C++ app and the speed-up relative to the non-SIMD algorithm is around 340%, which is better than the RyuJITted assembly. But this is still less than the AVX algorithm on an AVX only machine, using the C++ app, where the relative speed-up from non-SIMD is about 500+%. — eoinmullan, Feb 27 '16 at 00:13
That's exactly what I'd expect to see assuming your benchmark is memory-bound. If your Ivy Bridge machines have more memory bandwidth than the Haswell ones, then it's totally expected to see the scaling be higher on Ivy Bridge than Haswell. If that's the case, then no surprise here. — Mysticial, Feb 27 '16 at 00:19
Ah, I see, I was focusing only on the number of assembly instructions in the loop. I'll check out that memory bandwidth whenever I can. I'll also try to set up a test that performs no arithmetic, just passes the data through memory, to see if it hits the same limit. Many thanks to all. — eoinmullan, Feb 27 '16 at 00:30
@LưuVĩnhPhúc Yeah, but RyuJIT only uses 128 bits on AVX, and _mm256_add_epi16 is an invalid instruction on my AVX processor. It looks from the intel intrinsics guide that only double and float operations are available on 256 bit registers with AVX. — eoinmullan, Feb 27 '16 at 10:33
@Mysticial Two of the Ivy Bridges I tested are i5-3337U and i7-3770. One of the Haswells that I tested is i5-4670K. From looking at the intel specs it seems that the max memory bandwidth on all these processors is 25.6GB/s. Am I looking at the correct spec? Wouldn't that mean they should all be equally memory bound? — eoinmullan, Feb 27 '16 at 17:53
The memory bandwidth depends on the actual memory you put in it — harold, Feb 27 '16 at 22:21

score 16 · Answer 1 · answered Jun 04 '17 at 11:43

On an AVX processor, the upper half of the 256 bit registers and floating point units are powered down by the CPU when not executing AVX instructons (VEX encoded opcodes). When code does use AVX instructions, the CPU has to power up the FP units - this takes about 70 microseconds, during which time AVX instructions are actually executed using 128 micro-ops twice.

When AVX instructions haven't been used for about 700 microseconds, the CPU powers down the upper half of the circuitry again.

Now it does this because the upper half of the circuitry consumes power (doh!), and so generates heat (double doh!). This means that the CPU runs hotter when AVX instructions are used. So given that CPUs can "turbo boost" when they have thermal headroom, using AVX instructions reduces this chance, and in fact, the CPU actually reduces the "base clock speed". So if you have, for example, a CPU officially clocked at 2.3GHz that can turbo boost to 2.7, when you start using AVX instructions, the chip is clocked down to 2.1 and boosted to only 2.3, and in extreme cases the base clock may be reduced to 1.9 (see pages 2-4 of this).

At this stage, your CPU is executing ALL instructions about 10-15%, maybe even 20% SLOWER than when not using AVX instructions. If you're doing loads of SIMD operations, the 256 bit wide instructions make this worthwhile. But if you're doing a few AVX instructions, then "normal" code, then a bit of AVX again, then this clock speed penalty will cost more than all the gains you can make from AVX alone.

This can be why 128 bit wide SIMD can run faster than 256 bit wide unless you've got lengthy intensive bursts of SIMD-dominated operations. There is a price to using the rest of the silicon... (or perhaps more accurately, a reward for not using it that we sometimes forget we've been getting).

score 3 · Answer 2 · answered Mar 16 '16 at 09:09

(From the comments on the question)

If arithmetic operations are not the bottle neck in an algorithm's execution then using SIMD will not provide a speed-up. Other bottlenecks could be memory bandwidth, cache-sizes, memory speed, cache-speed. If a processor with AVX out-performs an AVX2 processor in these areas then it will benefit more from using SIMD intrinsics.

Why do processors with only AVX out-perform AVX2 processors for many SIMD algorithms?

2 Answers2

Linked