Is SSE redundant or discouraged?

Question

Looking around here and the internet, I can find a lot of posts about modern compilers beating SSE in many real situations, and I have just encountered in some code I inherited that when I disable some SSE code written in 2006 for integer-based image processing and force the code down the standard C branch, it runs faster.

On modern processors with multiple cores and advanced pipelining, etc, does older SSE code underperform gcc -O2?

“modern compilers beating SSE” Hmm, what? One is an instruction set extension, the other is a program that automatically translates program in one language to programs in another. They do not compete in the same tournament. — Pascal Cuoq, Nov 27 '15 at 00:24
@PascalCuoq: It's pretty clear he means compiler output from scalar code beating hand-written SSE code (using asm or intrinsics). Sounds like bad SSE code, IMO. Or auto-vectorizing compilers doing a better job with `-O3 -march=native`. — Peter Cordes, Nov 27 '15 at 00:35
@PeterCordes There is no doubt that a compiler generating 256-bit wide rich SSE4 instructions can beat handwritten 128-bit wide lame original SSE instructions. There is no doubt that a competent human can write better code than the compiler does for the same instruction set (ask http://stackoverflow.com/users/142434/stephen-canon his opinion). Is this question the obvious first one, the obvious second one? We have no idea. It refers to “a lot of posts” and it asks us to compare an instruction set and a compiler. — Pascal Cuoq, Nov 27 '15 at 07:11
@PascalCuoq: I've filed a few gcc bug reports myself when I've seen it generating slow code :P It's usually not hard to improve on compiler output, like I said in my answer. I updated it to address the other half of your answer (about old SSE code not using modern instructions). I mostly overlooked that before, and just talked about tuning for old microarches (unaligned loads). — Peter Cordes, Nov 27 '15 at 07:24

Peter Cordes · Accepted Answer · 2015-11-27T07:21:18.130

11

You have to be careful with microbenchmarks. It's really easy to measure something other than what you thought you were. Microbenchmarks also usually don't account for code size at all, in terms of pressure on the L1 I-cache / uop-cache and branch-predictor entries.

In most cases, microbenchmarks usually have all the branches predicted as well as they can be, while a routine that's called frequently but not in a tight loop might not do as well in practice.

There have been many additions to SSE over the years. A reasonable baseline for new code is SSSE3 (found in Intel Core2 and later, and AMD Bulldozer and later), as long as there is a scalar fallback. The addition of a fast byte-shuffle (pshufb) is a game-changer for some things. SSE4.1 adds quite a few nice things for integer code, too. If old code doesn't use it, compiler output, or new hand-written code, could do much better.

Currently we're up to AVX2, which handles two 128b lanes at once, in 256b registers. There are a few 256b shuffle instructions. AVX/AVX2 gives 3-operand (non-destructive dest, src1, src2) versions of all the previous SSE instructions, which helps improve code density even when the two-lane aspect of using 256b ops is a downside (or when targeting AVX1 without AVX2 for integer code).

In a year or two, the first AVX512 desktop hardware will probably be around. That adds a huge amount of powerful features (mask registers, and filling in more gaps in the highly non-orthogonal SSE / AVX instruction set), as well as just wider registers and execution units.

If the old SSE code only gave a marginal speedup over the scalar code back when it was written, or nobody ever benchmarked it, that might be the problem. Compiler advances may lead to the generated code for scalar C beating old SSE that takes a lot of shuffling. Sometimes the cost of shuffling data into vector registers eats up all the speedup of being fast once it's there.

Or depending on your compiler options, the compiler might even be auto-vectorizing. IIRC, gcc -O2 doesn't enable -ftree-vectorize, so you need -O3 for auto-vec.

Another thing that might hold back old SSE code is that it might assume unaligned loads/stores are slow, and used palignr or similar techniques to go between unaligned data in registers and aligned loads/stores. So old code might be tuned for an old microarch in a way that's actually slower on recent ones.

So even without using any instructions that weren't available previously, tuning for a different microarchitecture matters.

Compiler output is rarely optimal, esp. if you haven't told it about pointers not aliasing (restrict), or being aligned. But it often manages to run pretty fast. You can often improve it a bit (esp. for being more hyperthreading-friendly by having fewer uops/insns to do the same work), but you have to know the microarchitecture you're targeting. E.g. Intel Sandybridge and later can only micro-fuse memory operands with one-register addressing mode. Other links at the x86 wiki.

So to answer the title, no the SSE instruction set is in no way redundant or discouraged. Using it directly, with asm, is discouraged for casual use (use intrinsics instead). Using intrinsics is discouraged unless you can actually get a speedup over compiler output. If they're tied now, it will be easier for a future compiler to do even better with your scalar code than to do better with your vector intrinsics.

edited Nov 27 '15 at 07:21

answered Nov 27 '15 at 00:51

Peter Cordes

328,167
45
605
847

Thanks, I think that answers my question; the Windows code I have floods all 8 cores of an i7 (multithreaded SSE), yet porting it to Linux and disabling the "Use Threads" and "Use SSE" options makes it run faster on a single thread. I remembered a case long ago when a friend had hand-tuned 8086 code that was outperformed by C when compiled for the 80386 due to `repe cmpsb` (?) being less efficient on the '386, so I was curious if SSE had not aged well either. – Ken Y-N Nov 27 '15 at 06:36
@KenY-N: The main changes in SSE then vs. SSE now are that unaligned loads are much cheaper (when they don't split cache lines), so doing complicated stuff with `palignr` isn't needed. And [software-prefetch threads](http://www.akkadia.org/drepper/cpumemory.pdf) are not really a thing anymore (modern CPUs are wide enough for hyperthreading to run two normal threads effectively, if they have some branch mispredict or cache miss stalls rather than bottlenecking on some shared execution resource. And HW prefetchers are better and more aggressive). – Peter Cordes Nov 27 '15 at 06:56
2

Hmm, when you compile on Linux, are you ending up mixing SSE with AVX instructions? (`-march=native`, maybe?) There are big slowdowns on Intel CPUs for that. Also, presumably your threading isn't working well on Linux, because multithreading works the same on Linux as Windows, as long as you use the right APIs. Or maybe it's memory-bound with a single thread, and multithreading just causes more competition for cache. It could be a lot of things. – Peter Cordes Nov 27 '15 at 06:58
@PeterCordes, I was not aware that Core 2 and later have at least SSSE3. I thought Core 2 started at SSE2. Thanks for the info! BTW, your writing has improvement a lot in my opinion (though I don't think mine has). – Z boson Nov 27 '15 at 10:04
@Zboson: k8 only had SSE2, which is why it's the baseline for amd64. Intel didn't do a P6-family CPU with 64bit support until a few years later (Core2). This was unfortunate for Mac, since their first-gen x86 macs had to be 32bit only. If Intel had had 64bit in Core Duo, OS X probably never would have had to have a 32bit x86 kernel. :/ BTW, what do you mean about my writing improving? Content or delivery? I've probably gotten better at organizing thoughts into paragraphs and sections (I'm a big fan of the `---` horizontal-line delimiter :). I've also learned facts to put into answers. :) – Peter Cordes Nov 27 '15 at 10:11
@PeterCordes, I mean delivery, your writing is better organized now. You write a bit less colloquial and you make good use of the horizontal-line delimiter. Before sometimes it seemed like a brain dump which maybe made sense to you but took time for me to parse. – Z boson Nov 27 '15 at 10:15
@Zboson: yeah, some of my early stuff was very much brain-dump. :P Sometimes I'd figure something out and go looking for a place on SO to brain-dump it. Thanks for the compliment on my writing, BTW. I've found some of your answers interesting, too, since you've approached some things from a different angle than me. (More practical / experimental results, sometimes). – Peter Cordes Nov 27 '15 at 10:24
1

@Zboson: just thought of this, which you might find interesting: I haven't reformed completely from my brain-dumping ways: I posted http://stackoverflow.com/questions/32535222/memory-constrained-external-sorting-of-strings-with-duplicates-combinedcounted/32537772#32537772 in mid september. It went from brain-dump to more ideas -> research -> more brain dump, and I eventually added a bullet point summary at the start which grew to as big as a normal answer. The whole thing would have been longer and more rambling if I hadn't had to edit down to the 30kchar limit :P – Peter Cordes Nov 27 '15 at 11:21
Note that on Windows, you can assume SSE/SSE2 as a baseline generally. All x64 native processors must support it, and as of Windows 8.1 the x86 (32-bit) OS won't install without SSE, SSE2, NX, and PAE. Not having 'scalar fallbacks' in the code helps a lot because the extra indirection at runtime can sometimes cost as much as the operation especially if you are doing it on a low-level. See [this blog](http://blogs.msdn.com/b/chuckw/archive/2012/09/11/directxmath-sse-sse2-and-arm-neon.aspx) series for more about various processor instruction sets on Windows. – Chuck Walbourn Nov 27 '15 at 19:30
@ChuckWalbourn: Thanks for the heads-up about 32bit windows not working on ancient CPUs. WinXP still runs on AthlonXP, though, which is the most recent CPU to not support SSE2 (last made in ~2003). It's often worth doing CPU dispatching even with an SSE2 baseline, though. SSE4.1 has some great stuff for integer, and AVX/AVX2 are nifty. I like x264's way of doing it: call through function pointers. You set the function pointers once, at startup, with as complex logic as you like (e.g. avoid PSHUFB for some routines on CPUs where PSHUFB is slow (Core2), even though it's available). – Peter Cordes Nov 27 '15 at 21:54
Anyway, then each call to a routine with multiple versions takes one branch-target-buffer entry, and 8bytes of D-cache to avoid a mispredict, or worse, a cache miss. (Hopefully routines that are called together have their entries in the table of function pointers nearby, so as few lines as possible of D$ are taken up with hot function pointers.) – Peter Cordes Nov 27 '15 at 21:58
1

The function dispatch works if the functions are 'heavy-weight' enough to justify the extra indirection. For DirectXMath, I rely on inlining and assume SSE/SSE2 as the baseline so that I can avoid the indirection and let the compiler do the composition. With Xbox One, using ``/arch:AVX`` gives you most of the value of AVX by using the VEX prefix for all SSE/SSE2 instruction codegen. – Chuck Walbourn Nov 28 '15 at 06:44
@ChuckWalbourn: yeah, that's true, for small helper functions you might have to bring the indirect-call higher up in the call tree or something. You'd maybe have to generate the larger functions twice, once with SSE2 inlined into them, and once with SSE4 / AVX inlined. Also, if you compile with AVX enabled globally, you can use preprocessor macros to detect that and use any 128bit intrinsic you want (including SSE4 instructions). So if you have any routines which could be much better with SSE4, that's an easy way to use them at compile time when building for a modern target. – Peter Cordes Nov 28 '15 at 07:04
AVX's 3-operand mov-elimination benefit is nice, but there are cases where PSHUFB, PMOVZX, or the various integer stuff for different element widths and signedness in SSSE3 and SSE4 can really help. You're not getting any of that just compiling SSE2 intrinsics into VEX-encoded instructions. I'd hardly say that's "most of the benefit" of AVX. Oh, although it probably is a big benefit on Xbox-One's AMD Jaguar cores, now that I looked to see what CPU is in that thing. Intel IvB's mov-elimination means the extra mov insns only eat uop-cache and pipeline slots, not execution ports or latency. – Peter Cordes Nov 28 '15 at 07:06

score 9 · Answer 2 · edited May 23 '17 at 12:00

9

Just to add to Peter's already excellent answer, one fundamental point to consider is that the compiler does not know everything that the programmer knows about the problem domain, and there is in general no easy way for the programmer to express useful constraints and other relevant information that a truly smart compiler might be able to exploit in order to aid vectorization. This can give the programmer a huge advantage in many cases.

For example, for a simple case such as:

// add two arrays of floats

float a[N], b[N], c[N];

for (int i = 0; i < N; ++i)
    a[i] = b[i] + c[i];

any decent compiler should be able to do a reasonably good job of vectorizing this with SSE/AVX/whatever, and there would be little point in implementing this with SIMD intrinsics. Apart from relatively minor concerns such as data alignment, or the likely range of values for N, the compiler-generated code should be close to optimal.

But if you have something less straightforward, e.g.

// map array of 4 bit values to 8 bit values using a LUT

const uint8_t LUT[16] = { 0, 1, 3, 7, 11, 15, 20, 27, ..., 255 };
uint8_t in[N];   // 4 bit input values
uint8_t out[N];  // 8 bit output values

for (int i = 0; i < N; ++i)
    out[i] = LUT[in[i]];

you won't see any auto-vectoization from your compiler because (a) it doesn't know that you can use PSHUFB to implement a small LUT, and (b) even if it did, it has no way of knowing that the input data is constrained to a 4 bit range. So a programmer could write a simple SSE implementation which would most likely be an order of magnitude faster:

__m128i vLUT = _mm_loadu_si128((__m128i *)LUT);
for (int i = 0; i < N; i += 16)
{
    __m128i va = _mm_loadu_si128((__m128i *)&b[i]);
    __m128i vb = _mm_shuffle_epi8(va, vLUT);
    _mm_storeu_si128((__m128i *)&a[i], vb);
}

Maybe in another 10 years compilers will be smart enough to do this kind of thing, and programming languages will have methods to express everything the programmer knows about the problem, the data, and other relevant constraints, at which point it will probably be time for people like me to consider a new career. But until then there will continue to be a large problem space where a human can still easily beat a compiler with manual SIMD optimisation.

edited May 23 '17 at 12:00

Community

1
1

answered Nov 27 '15 at 08:45

Paul R

208,748
37
389
560

2

Maybe C should add a basic SIMD datatype and give the programmer explicit control over SIMD rather than with auto-vectorization. – Z boson Nov 27 '15 at 10:06
1

@Zboson: But would that be an MMX datatype, SSE, AVX256 or AVX512, or an ARM Neon? – MSalters Nov 27 '15 at 12:15
@Paul : For your `PSHUFB` example, assumption (b) is just wrong. Compilers today already are smart enough to notice that `in[i]` is used as the index of an `uint_8[16]`, and _therefore_ restricted to 4 bits. I also fail to see why a peephole optimizer couldn't recognize that LUT and translate it into a `PSHUFB` straight away. Peephole optimizers are decades old. – MSalters Nov 27 '15 at 12:19
@MSalters, have you used OpenCL C? It would be something like `floatn` and `intn` where n would go from e.g. 2 to 16. If you used `float8` on an SSE system it would implement SSE twice. In many cases that would the same as using only `float4`. – Z boson Nov 27 '15 at 12:19
@Zboson: Sound like a case where C++ would make things a lot easier. C++ already has `std::array` – MSalters Nov 27 '15 at 12:24
1

@MSalters: well if you can show me an actual example of a compiler that would implement the above LUT using `PSHUFB` then I'll be suitably impressed, and will have to work harder to find a better example. – Paul R Nov 27 '15 at 12:30
1

Given how many people know and really use the `restrict` keyword, I doubt that a compiler would ever have all the necessary information, even if there were ways to express it in C... – stgatilov Nov 27 '15 at 13:21
@stgatilov: indeed, and ditto for things like `__builtin_expect`. – Paul R Nov 27 '15 at 13:34
2

@PaulR: The `__builtin_expect` thing can be used to tell tons of knowledge to the compiler. But unfortunately, the compilers understand only the dumbest hints like "`pMem != nullptr`, so please do not check for null in placement new". – stgatilov Nov 27 '15 at 14:22
1

@MSalters, I'm well aware of [the best x86 SIMD C++ library](http://www.agner.org/optimize/#vectorclass) (and I don't see how `std::array` can compete). How many people are using this library? Don't you think if this was built into the language more people would be explicitly using SIMD? Not to mention that I find OpenCL's vector syntax for shuffling (e.g. `v.xyzw`) very convenient. – Z boson Nov 27 '15 at 17:04

zam · Answer 3 · 2015-11-30T20:28:49.753

These were two separate and strictly speaking unrelated questions:

1) Did SSE in general and SSE-tuned codebases in particular become obsolete / "discouraged" / retired?

Answer in brief: not yet and not really. High Level Reason: because there are still enough hardware around (even in HPC domain, where one could easily find Nehalem) which only have SSE* on board, but no AVX* available. If you look outside HPC, then consider for example Intel Atom CPU, which currently supports only up to SSE4.

2) Why gcc -O2 (i.e. auto-vectorized, running on SSE-only hardware) is faster than some old (presumably intrinsics) SSE implementation written 9 years ago.

Answer: it depends, but first of all things are very actively improving on Compilers side. AFAIK top 4 x86 compilers dev teams has made big to enormous investments into auto-vectorization or explicit-vectorization domains in the course of past 9 years. And the reason why they did so is also clear: SIMD "FLOPs" potential in x86 hardware has been increased (formally) "by 8 times" (i.e. 8x of SSE4 peak flops) in the course of past 9 years.

Let me ask one more question myself:

3) OK, SSE is not obsolete. But will it be obsolete in X years from now?

Answer: who knows, but at least in HPC, with wider AVX-2 and AVX-512 compatible hardware adoption, SSE intrinsics codebases are highly likely to retire soon enough, although it again depends on what you develop. Some low-level optimized HPC/HPC+Media libraries will likely keep highly tuned SSE code pathes for long time.

Intel still sell brand-new mainstream CPUs like Skylake / Kaby Lake without AVX support, under the [Pentium and Celeron labels](https://en.wikipedia.org/wiki/Pentium) (below i3/i5/i7). They also leave out BMI2, because I think they just disable VEX decoding entirely. Presumably this lets them sell chips that have defects in the upper 128 bits of SIMD execution units, but otherwise work fine. This sucks because it means AVX and BMI2 are still *very* far from becoming a universal baseline that most software can assume support for. — Peter Cordes, Mar 26 '18 at 21:39

score 0 · Answer 4 · answered Nov 27 '15 at 00:32

0

You might very well see modern compilers use SSE4. But even if they stick to the same ISA, they're often a lot better at scheduling. Keeping SSE units busy means careful management of data streaming.

Cores are irrelevant as each instruction stream (thread) runs on a single core.

answered Nov 27 '15 at 00:32

MSalters

173,980
10
155
350

2

Given huge reorder buffers in modern x86 CPUs, is instruction scheduling really important? Also, isn't it true that compiler still performs that 'scheduling' job if you code in SSE intrinsics instead of writing assembly code directly? – stgatilov Nov 27 '15 at 04:59
AFAIK, instruction scheduling still is important, if only because compilers know better than most programmers how to exploit those reorder buffers. I don't know how smart compilers are with SSE intrinsics. – MSalters Nov 27 '15 at 12:07
Yes, one of the big benefits of using intrinsics over raw asm is that you get all the benefits of peephole optimisation, instruction scheduling, register management, etc - I've seen clang in particular do some amazing things with code generation for SSE intrinsics. – Paul R Nov 27 '15 at 12:39

score 0 · Answer 5 · answered Nov 27 '15 at 00:36

Yes -- but mainly in the same sense that writing inline assembly is discouraged.

SSE instructions (and other vector instructions) have been around long enough that compilers now have a good understanding of how to use them to generate efficient code.

You won't do a better job than the compiler unless you have a good idea what you're doing. And even then it often won't be worth the effort spent trying to beat the compiler. And even then our efforts at optimizing for one specific CPU might not result in good code for other CPUs.

Do you want to say that auto-vectorization is so good nowadays that writing SSE code manually is not necessary any more? I have heard there are tons of limitations in auto-vectorization... Regarding 'worth the effort', it heavily depends on the particular usage case. — stgatilov, Nov 27 '15 at 05:09

Is SSE redundant or discouraged?

5 Answers5