What kinds of C optimization practices in x86 that were historically recommended to use are no longer effective?

Question

Due to the advances in x86 C compilers (namely GCC and Clang), many coding practices that were believed to improve efficiency are no longer used since the compilers can do a better job optimizing the code than humans (e.g. bit shift vs. multiplication).

Which specific practices are these?

I disagree with the close decision. There are some practices which have been clearly beneficial, but which are just as clearly counterproductive now. These recommendations are usually based in hard facts, and these hard facts change, changing the recommendations. And the change is a hard fact as well. — cmaster - reinstate monica, Jun 05 '14 at 20:32
Perhaps "opinion-based" is wrong, but certainly "too broad" isn't. I mean, is everyone supposed to leave one answer or is one person supposed to post every answer? Are we talking x86 or should we consider ARM and PPC processors? Which compiler and at what optimization level? A much better question would be "Optimization *x* has been historically recommended [citation needed]. Does it still apply with a modern compiler like gcc in x86-64 programs?" — indiv, Jun 05 '14 at 20:40
I'm waiting for someone to say, "All. There is absolutely no reason to optimize at all. The compiler will do everything for you. It will even make you breakfast." — Mysticial, Jun 05 '14 at 20:46
All. There is absolutely no reason to optimize at all. The compiler will do everything for you. It will even make you breakfast. — , Jun 05 '14 at 20:46
I've seen a lot of micro-optimisations (here) based on false assumptions stemming from the eighties. Possibly because todays teachers were raised in the days of the 6800, the 6502 and the 8086, when cycle counting still was effective. — wildplasser, Jun 05 '14 at 20:47
A concrete old-style optimization is the use of the keyword `register`. I've not used it in aeons (well, this millennium). Compilers can do register allocation better than I can, and generally insist on doing it regardless of what I say. (I've never used `auto` in C code, either, but that's not really an optimization issue.) — Jonathan Leffler, Jun 05 '14 at 20:50
[Duff's device](http://en.wikipedia.org/wiki/Duff%27s_device), "When numerous instances of Duff's device were removed from the XFree86 Server in version 4.0, there was an improvement in performance." — Elliott Frisch, Jun 05 '14 at 20:50
And an example of the lengths optimizers will go to (when you'd rather they didn't) is [Why is GCC 4.8.2 complaining about addition under strict overflow?](http://stackoverflow.com/questions/23020208/why-is-gcc-4-8-2-complaining-about-addition-under-strict-overflow) — Jonathan Leffler, Jun 05 '14 at 20:53
The rule of thumb that remains is IMHO: keep your loops tight. The cause for this used to be the register pressure, nowadays it is mostly the (failing) loop prediction. The thing that changed: keep your data tight. Cache locality dictates performance in most cases nowadays. — wildplasser, Jun 05 '14 at 20:53
@JonathanLeffler I thought of `register` too. That raised a question. We are not allowed to take the address of the `register` qualified object. When compiler puts an object `register`, does it account for whether or not you have taken the address of that object in the code? — P.P, Jun 05 '14 at 20:55
@BlueMoon: `register` is generally a nop as an optimization, but it's useful in that it makes applying the `&` operator a diagnosed error. This is useful for many other purposes like avoiding introducing code where the compiler may be unable to deduce that the object is not aliased. So I think it's still useful, just for different purposes. — R.. GitHub STOP HELPING ICE, Jun 05 '14 at 20:57
`register` was just the result of register pressure and register allocation problems. Also: thrashing to and fro the stack will be mostly cached nowadays. — wildplasser, Jun 05 '14 at 20:57
@BlueMoon: it has to take that into account. Now, in theory, it might mostly keep the value in a register but allocate space in memory for it to be copied to when its address must be passed to a function, but my gut feel is it would seldom be effective to do so. However, if you had triple-nested loops and the address was only passed once per iteration of the outermost loop, but the value was used a lot in the inner loops, some such strategy might be beneficial. — Jonathan Leffler, Jun 05 '14 at 20:58
@Mys_721tx: since this question keeps getting closed, I think a more pragmatic approach to getting good answers might be opening new questions for *specific optimizations*, of the form "Is [optimization X], which was historically common, still a valid optimization technique?" — R.. GitHub STOP HELPING ICE, Jun 05 '14 at 21:32
@R..: That's a definite improvement, but without a particular microarchitecture and compiler implementation specified like indiv suggested, it's still untenably broad. — Ben Voigt, Jun 05 '14 at 22:38
@BenVoigt: I disagree. At the C source level, most of the obsolete optimizations are a matter of evolution of *compiler technology*, not hardware. — R.. GitHub STOP HELPING ICE, Jun 06 '14 at 01:16
@R..: The examples in the existing answers are about an even split. The double-indirection vs 2-D array thing is definitely dependent on hardware, not compiler. And the use of XOR for swap which duskwulf calls "just a stupid trick" is quite a bit faster on a Microchip PIC which has only a single working register, because `^=` is a single instruction, while `=` requires two plus additional scratch space. — Ben Voigt, Jun 06 '14 at 01:29
@BenVoigt: I disagree with your interpretation of the xor thing. If it's faster then the compiler should compile a swap to use xors. It should not make you write such idiotic "party tricks" in your source. If the compiler fails to do this, it's a bad compiler. The 2D array example is the only one I've seen so far that actually has anything to do with hardware. — R.. GitHub STOP HELPING ICE, Jun 06 '14 at 03:01

score 18 · Accepted Answer · answered Jun 05 '14 at 20:57

18

Of the optimizations which are commonly recommended, a couple which are basically never fruitful given modern compilers include:

Mathematical transformations

Modern compilers understand mathematics, and will perform transformations on mathematical expressions where appropriate.

Optimizations such as conversion of multiplication to addition, or constant multiplication or division to bit shifting, are already performed by modern compilers, even at low optimization levels. Examples of these optimizations include:

x * 2  ->  x + x
x * 2  ->  x << 1

Note that some specific cases may differ. For instance, x >> 1 is not the same as x / 2; it is not appropriate to substitute one for the other!

Additionally, many of these suggested optimizations aren't actually any faster than the code they replace.

Stupid code tricks

I'm not even sure what to call this, but tricks like XOR swapping (a ^= b; b ^= a; a ^= b;) are not optimizations at all. They're just party tricks — they are slower, and more fragile, than the obvious approach. Don't use them.

The `register` keyword

This keyword is ignored by many modern compilers, as it its intended meaning (forcing a variable to be stored in a register) is not meaningful given current register allocation algorithms.

Code transformations

Compilers will automatically perform a wide variety of code transformations where appropriate. Several such transformations which are frequently recommended for manual application, but which are rarely useful when applied thus, include:

Loop unrolling. (This is often actually harmful when applied indiscriminately, as it bloats code size.)
Function inlining. (Tag a function as static, and it'll usually be inlined where appropriate when optimization is enabled.)

answered Jun 05 '14 at 20:57

3

Loop unrolling and force inlining can still give huge speedups when done properly. Everything else you mention is almost universally true nowadays. – Mysticial Jun 05 '14 at 21:00
@Mysticial: Loop unrolling is rarely useful unless the loop content is utterly trivial, and in *most* cases, the compiler can unroll it properly. However I've found cases where manual unrolling can still beat anything a compiler can do. I have a crazy `strlen` implementation based on nothing but a particular form of unrolling (and the benefits of asymptotically-100%-correct branch prediction) that beats anything you can do without >=64-bit vectorization on most if not all archs. – R.. GitHub STOP HELPING ICE Jun 05 '14 at 21:18
1

@R.. In most of the cases that I run into, the loop body is 100+ cycle dependency chain of high-latency floating-point instructions. The compiler refuses to unroll it because it's too big. And the body is too large to fit into the CPU re-order buffer. So there's a lot of ILP to be gained from manually unrolling it and interleaving the iterations. – Mysticial Jun 05 '14 at 21:21
@Mysticial: Can you compile such files with `-funroll-all-loops`, or use the equivalent `#pragma` or `__attribute__` on the functions? Obviously it would be nice for the compiler to do the interleaving, and I think it can as long as you avoid aliasing issues (e.g. using the `restrict` keyword). – R.. GitHub STOP HELPING ICE Jun 05 '14 at 21:30
@R.. I've never tried it beyond small snippets for experimentation. The main problem that I found is that the compiler does not interleave if the body is too large. The reason being the scheduling algorithm runs in O(N^2). So when you have (from my experience) more than few hundred of them, the compiler freaks out and falls back to something else. The interleaving is trivial for me to do because I have big-picture knowledge of the data-dependencies. But to the compiler it's just a DAG with hundreds or thousands of nodes and edges - which takes a tremendous amount of work to process. – Mysticial Jun 05 '14 at 21:36
In the end, for such loops I found that manual loop unroll combined with manual vectorization does best and is cross-compiler compatible. (and with an unoptimized version to fall back to if it's ever needed) – Mysticial Jun 05 '14 at 21:36
@Mysticial: gcc's [`--param` option](https://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/Optimize-Options.html#index-param-982) has a number of parameters that may make such scheduling possible. – R.. GitHub STOP HELPING ICE Jun 06 '14 at 01:16
On what compiler does `static` help with inlining? Moving the function to a header file and adding the `inline` keyword should be good enough. (Yes, I know `inline` affects linking, not inlining. It's moving the function into a header where it can be seen at point of use that enables inlining, and `inline` cures the resulting link multiple definition errors) – Ben Voigt Jun 06 '14 at 01:35
@R.. I admit I've never gone *that* far. I've always resorted to doing the optimizations manually long before I start messing with the compiler at that level. – Mysticial Jun 06 '14 at 01:36

cmaster - reinstate monica · Answer 2 · 2014-06-06T19:29:33.250

One such practice is to avoid multiplications by using arrays of array pointers instead of real 2D arrays.

Old practice:

int width = 1234, height = 5678;
int* buffer = malloc(width*height*sizeof(*buffer));
int** image = malloc(height*sizeof(*image));
for(int i = height; i--; ) image[i] = &buffer[i*width];

//Now do some heavy computations with image[y][x].

This used to be faster, because multiplications used to be very expensive (on the order of 30 CPU cycles), while memory accesses were virtually free (it was only in the 1990s that caches were added because memory couldn't keep up with full CPU speed).

But multiplications became fast, some CPUs being able to do them in one CPU cycle, while memory accesses did not keep pace at all. So, now this code is likely to be more performant:

int width = 1234, height = 5678;
int (*image)[width] = malloc(height*sizeof(*image));

//Now do some heavy computations with image[y][x],
//which will invoke pointer arithmetic to calculate the offset as (y*width + x)*sizeof(int).

Currently, there are still some CPUs around, where the second code is not faster, but the big multiplication penalty is not with us anymore.

Technicality: you wouldn't multiply by `sizeof(int)` in the second example; the compiler deals with that scaling for you. — Jonathan Leffler, Jun 05 '14 at 21:03
@JonathanLeffler If I'm not much mistaken, this multiplication is actually done as a part of the load instruction on X86 CPUs. If it were compiled to an instruction of its own, it would usually compile to a bitshift. In either case, somewhere the real, physical offset in bytes has to be calculated, explicitely or implicitely, and that involves multiplying by the size of `int`. — cmaster - reinstate monica, Jun 05 '14 at 21:07
Minor: Should not the example `image[i] = buffer[i*width]` --> `image[i] = &buffer[i*width]`, missing `&`? — chux - Reinstate Monica, Jun 05 '14 at 21:56

score 4 · Answer 3 · answered Jun 05 '14 at 21:02

Due to the plurality of platforms you would at best optimize for a given platform (or CPU Architecture/Model) and compiler!! If your code is running on many platforms it's a waste of time. (I'm talking about micro opts, it's always worth considering better algorithms)

This said optimizing for a given platform, DSP makes sense if the need for it arises. Then the best first helper is IMHO the judicious use of restrict keyword if the compiler/optimizer supports it well. Avoid algorithms involving conditions and jumpy code (breaks, goto, if, while, ...) This favors streaming and avoids too many bad branch predictions. I would agree these hints are common sense now.

Generally speaking I would say: Any manipulation that modifies the code by making assumptions on how the compiler optimizes shall be IMHO avoided at all.

Rather, then switch to assembly (common practice for some really important algorithms in DSPs where the compilers while being really great still miss the last few % of CPU/Mem cycles performance increase...)

score 0 · Answer 4 · answered Sep 10 '15 at 13:23

One optimization that really shouldn't be used much more is #define (expanding on duskwuff's answer a bit).

The C preprocessor is a wonderful thing, and it can do some amazing code transformations, and it can make certain really complex code much simpler — but using #define just to cause a small operation to be inlined isn't usually appropriate anymore. Most modern compilers have a real inline keyword (or equivalent, like __inline__), and they're smart enough to inline most static functions anyway, which means that code like this:

#define sum(x, y) ((x) + (y))

is really better written as the equivalent function:

static int sum(int x, int y)
{
    return x + y;
}

You avoid dangerous multiple-evaluation problems and side-effects, you get compiler type-checking and you end up with cleaner code, too. If it's worth inlining, the compiler will do it.

In general, save the preprocessor for the circumstances where it's needed: Emitting a lot of complex, variant code or partial code quickly. Using the preprocessor for inlining small functions and defining constants is mostly an antipattern now.

What kinds of C optimization practices in x86 that were historically recommended to use are no longer effective?

4 Answers4

Mathematical transformations

Stupid code tricks

The register keyword

Code transformations

The `register` keyword