Coding Practices which enable the compiler/optimizer to make a faster program

Question

Many years ago, C compilers were not particularly smart. As a workaround K&R invented the register keyword, to hint to the compiler, that maybe it would be a good idea to keep this variable in an internal register. They also made the tertiary operator to help generate better code.

As time passed, the compilers matured. They became very smart in that their flow analysis allowing them to make better decisions about what values to hold in registers than you could possibly do. The register keyword became unimportant.

FORTRAN can be faster than C for some sorts of operations, due to alias issues. In theory with careful coding, one can get around this restriction to enable the optimizer to generate faster code.

What coding practices are available that may enable the compiler/optimizer to generate faster code?

Identifying the platform and compiler you use, would be appreciated.
Why does the technique seem to work?
Sample code is encouraged.

Here is a related question

[Edit] This question is not about the overall process to profile, and optimize. Assume that the program has been written correctly, compiled with full optimization, tested and put into production. There may be constructs in your code that prohibit the optimizer from doing the best job that it can. What can you do to refactor that will remove these prohibitions, and allow the optimizer to generate even faster code?

[Edit] Offset related link

Could be a good candidate for community wiki imho since there's no 'single' definitive answer to this (interesting) question... — ChristopheD, Jan 15 '10 at 19:16
By 'better' do you mean simply 'faster' or do you have other criteria of excellence in mind ? — High Performance Mark, Jan 15 '10 at 19:18
I was thinking c, but there is no real reason to restrict it to that. I will update the tags — EvilTeach, Jan 15 '10 at 19:29
I wonder about this history of the register keyword. They were writing an OS -- maybe it made sense because of the target of their system? — Hogan, Jan 15 '10 at 21:02
It's pretty hard to write a good register allocator, especially portably, and register allocation is absolutely essential to performance and code size. `register` actually made performance-sensitive code more portable by combating poor compilers. — Potatoswatter, Jan 17 '10 at 01:26
@EvilTeach: community wiki doesn't mean "no definitive answer", its not synonymous with the subjective tag. Community wiki means you want to surrender your post to the community so other people can edit it. Don't feel pressured to wiki your questions if you don't feel like it. — Juliet, Apr 07 '10 at 16:39
I have done a number of these. The question is good. I am expecting to see a stuff that I don't know about that, that will add another arrow to my quiver. Rep is a side effect. I agree with Chris. There is no best answer. I Picked the one that seemed most surprising. Local variables also show up automatically in an msvc window. That is a plus. — EvilTeach, Aug 21 '11 at 14:56

score 78 · Answer 1 · edited Mar 18 '13 at 20:14

78

Here's a coding practice to help the compiler create fast code—any language, any platform, any compiler, any problem:

Do not use any clever tricks which force, or even encourage, the compiler to lay variables out in memory (including cache and registers) as you think best. First write a program which is correct and maintainable.

Next, profile your code.

Then, and only then, you might want to start investigating the effects of telling the compiler how to use memory. Make 1 change at a time and measure its impact.

Expect to be disappointed and to have to work very hard indeed for small performance improvements. Modern compilers for mature languages such as Fortran and C are very, very good. If you read an account of a 'trick' to get better performance out of code, bear in mind that the compiler writers have also read about it and, if it is worth doing, probably implemented it. They probably wrote what you read in the first place.

edited Mar 18 '13 at 20:14

Keith Pinson

7,835
7
61
104

answered Jan 15 '10 at 19:26

High Performance Mark

77,191
7
105
161

21

Compiier developers have finite time, just like everyone else. Not all optimizations will make it into the compiler. Like `&` vs. `%` for powers of two (seldom, if ever, optimized, but can have significant performance impacts). If you read a trick for performance, the only way to know if it works is to make the change and measure the impact. *Never assume the compiler will optimize something for you.* – Dave Jarvis Jan 15 '10 at 20:13
22

& and % is pretty much always optimized, along with most other cheap-as-free arithmetic tricks. What doesn't get optimized is the case of the right-hand operand being a variable that just happens always to be a power of two. – Potatoswatter Jan 15 '10 at 20:17
8

To clarify, I seem to have confused some readers: the advice in the coding practice I propose is to first develop a straightforward code which does not make use of memory-layout instructions to establish a baseline of performance. Then, try things one at a time and measure their impact. I have not offered any advice on the performance of operations. – High Performance Mark Jan 15 '10 at 20:50
Yep, it's easy to shoot yourself in the foot with fine-grained memory layout, and never know it because the simple layout is too different. What do you mean by "memory layout instructions", vector pack/unpack? – Potatoswatter Jan 15 '10 at 20:56
The last time I laid stuff out in memory to optimize cache performance was for the Intel i860. The only reason I did that was because the OS docs said, "you will do this and you will use this buffer to do it or all our custom optimizations will fail." I hate that architecture to this day. – Mike DeSimone Jan 15 '10 at 21:28
17

For constant power-of-two `n`, gcc replaces `% n` with `& (n-1)` *even when optimisation is disabled*. That's not exactly "seldom, if ever" ... – Porculus Jan 15 '10 at 21:28
5

I'm not sure this is sound advice, at least as currently written. How you lay out your data in memory, at least in the big picture --- such as whether you allocate on the stack or the heap, whether the memory being read inside your inner loop is spatially localized or not, etc --- is very much the programmer's responsibility, and has a *huge* effect on performance. If you do these things wrong no C++ compiler I know of can fix it for you. – user168715 Jan 15 '10 at 21:31
13

% **CANNOT** be optimized as & when the type is signed, due to C's idiotic rules for negative integer division (round towards 0 and have negative remainder, rather than rounding down and always having positive remainder). And most of the time, ignorant coders use signed types... – R.. GitHub STOP HELPING ICE Jul 03 '10 at 21:08
1

@R: Unless of course, the compiler can prove that the variable is positive. Which actually is pretty often. – Ben Voigt Dec 04 '10 at 21:13
@Ben - the compiler can't prove that the variable is positive if the function has external linkage. Which is almost all functions. – Tom Jul 03 '11 at 03:15
1

@Tom: Not quite. If the function is defined in the same compilation unit, the compiler is allowed to perform range analysis on it or even inline it, linkage doesn't matter. (If the linker chooses some other definition for which the analysis isn't valid, that's an ODR violation and behavior is undefined anyway). – Ben Voigt Jul 03 '11 at 03:58
@R.. `ignorant coders use signed types` That is not true (anymore) It is recommended to use signed types everywhere where you don't do bit operations. There are multiple talks on CppCon etc. and the CppCoreGuidelines (index type being signed) Reason is that the UB for over/underflow allows plenty of optimization and reduces bugs – Flamefire Feb 22 '18 at 15:52
I'd love to hear about an example where expected performance improvement from (supposedly) better memory layout didn't materialize – Shihab Shahriar Khan Mar 28 '23 at 03:28

score 55 · Accepted Answer · edited Jul 03 '15 at 02:24

55

Write to local variables and not output arguments! This can be a huge help for getting around aliasing slowdowns. For example, if your code looks like

void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
    for (int i=0; i<numFoo, i++)
    {
         barOut.munge(foo1, foo2[i]);
    }
}

the compiler doesn't know that foo1 != barOut, and thus has to reload foo1 each time through the loop. It also can't read foo2[i] until the write to barOut is finished. You could start messing around with restricted pointers, but it's just as effective (and much clearer) to do this:

void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
    Foo barTemp = barOut;
    for (int i=0; i<numFoo, i++)
    {
         barTemp.munge(foo1, foo2[i]);
    }
    barOut = barTemp;
}

It sounds silly, but the compiler can be much smarter dealing with the local variable, since it can't possibly overlap in memory with any of the arguments. This can help you avoid the dreaded load-hit-store (mentioned by Francis Boivin in this thread).

edited Jul 03 '15 at 02:24

QuantumFool

609
3
10
29

answered Jan 15 '10 at 21:57

celion

3,864
25
19

7

This has the added benefit of often making things easier to read/understand for programmers as well, since they don't have to worry about possible non-obvious side-effects either. – Michael Burr Jan 15 '10 at 23:18
Most IDEs display local variables by default, so there is less typing – EvilTeach Mar 16 '10 at 21:12
10

you can also enable that optimization by using restricted pointers – Ben Voigt Mar 24 '10 at 23:27
4

@Ben - that's true, but I think this way is clearer. Also, if the input and output did overlap, I believe the result is unspecified with restricted pointers (probably get different behavior between debug and release), whereas this way will at least be consistent. Don't get me wrong, I like using restrict, but I like not needing it even more. – celion Mar 25 '10 at 07:25
You've just got to hope that Foo doesn't have a copy operation defined that copies a couple of meg of data ;-) – Skizz Apr 07 '10 at 16:39
@Skizz: True, the times that I've used this trick, "Foo" in question has been a plain-old-data struct, and "munge" was a fast, inlined method. In those cases, the few extra instructions spent copying were well worth avoiding the extra loads/stores that I avoided. – celion Apr 07 '10 at 20:38
The original question was about C, not C++, so both constructors and references are off-topic... – R.. GitHub STOP HELPING ICE Jul 03 '10 at 21:04
This is a pessimization if `munge` is not inlinable; see http://stackoverflow.com/questions/7761810/optimization-trick-does-it-really-work/7762051#7762051 – bdonlan Oct 14 '11 at 01:00
This code likely does not optimize well, if munge() is not an inline function. Using a local variable seems to be a good idea for good coding practice, though. There is an example that works at http://stackoverflow.com/questions/7761810/optimization-trick-does-it-really-work/7762094#7762094 – casualcoder Oct 14 '11 at 02:52
People seem to be getting hung up on `Foo` and `munge`. Let's consider this instead: void sum(const float* vals, int numVals, float& sumOut){ sumOut = 0.0f; for (int i=0; i – celion Oct 15 '11 at 17:48

score 48 · Answer 3 · edited Jun 26 '11 at 01:55

48

The order you traverse memory can have profound impacts on performance and compilers aren't really good at figuring that out and fixing it. You have to be conscientious of cache locality concerns when you write code if you care about performance. For example two-dimensional arrays in C are allocated in row-major format. Traversing arrays in column major format will tend to make you have more cache misses and make your program more memory bound than processor bound:

#define N 1000000;
int matrix[N][N] = { ... };

//awesomely fast
long sum = 0;
for(int i = 0; i < N; i++){
  for(int j = 0; j < N; j++){
    sum += matrix[i][j];
  }
}

//painfully slow
long sum = 0;
for(int i = 0; i < N; i++){
  for(int j = 0; j < N; j++){
    sum += matrix[j][i];
  }
}

edited Jun 26 '11 at 01:55

new123456

873
1
12
21

answered Jan 15 '10 at 20:21

vicatcu

5,407
7
41
65

Strictly speaking this is not an optimizer issue, but is an optimization issue. – EvilTeach Jan 15 '10 at 20:36
10

Sure it's an optimizer issue. People have been writing papers about automatic loop interchange optimization for decades. – Phil Miller Jan 15 '10 at 21:41
20

@Potatoswatter What are you talking about? The C compiler can do whatever it wants to as long as the same final result is observed, and indeed GCC 4.4 has `-floop-interchange` which will flip an inner and outer loop if the optimizer deems it profitable. – ephemient Jan 16 '10 at 07:27
2

Huh, well there you go. C semantics are often marred by aliasing issues. I guess the real advice here is to pass that flag! – Potatoswatter Jan 16 '10 at 16:49

Thomas Matthews · Answer 4 · 2010-01-17T18:01:15.890

Generic Optimizations

Here as some of my favorite optimizations. I have actually increased execution times and reduced program sizes by using these.

Declare small functions as `inline` or macros

Each call to a function (or method) incurs overhead, such as pushing variables onto the stack. Some functions may incur an overhead on return as well. An inefficient function or method has fewer statements in its content than the combined overhead. These are good candidates for inlining, whether it be as #define macros or inline functions. (Yes, I know inline is only a suggestion, but in this case I consider it as a reminder to the compiler.)

Remove dead and redundant code

If the code isn't used or does not contribute to the program's result, get rid of it.

Simplify design of algorithms

I once removed a lot of assembly code and execution time from a program by writing down the algebraic equation it was calculating and then simplified the algebraic expression. The implementation of the simplified algebraic expression took up less room and time than the original function.

Loop Unrolling

Each loop has an overhead of incrementing and termination checking. To get an estimate of the performance factor, count the number of instructions in the overhead (minimum 3: increment, check, goto start of loop) and divide by the number of statements inside the loop. The lower the number the better.

Edit: provide an example of loop unrolling Before:

unsigned int sum = 0;
for (size_t i; i < BYTES_TO_CHECKSUM; ++i)
{
    sum += *buffer++;
}

After unrolling:

unsigned int sum = 0;
size_t i = 0;
**const size_t STATEMENTS_PER_LOOP = 8;**
for (i = 0; i < BYTES_TO_CHECKSUM; **i = i / STATEMENTS_PER_LOOP**)
{
    sum += *buffer++; // 1
    sum += *buffer++; // 2
    sum += *buffer++; // 3
    sum += *buffer++; // 4
    sum += *buffer++; // 5
    sum += *buffer++; // 6
    sum += *buffer++; // 7
    sum += *buffer++; // 8
}
// Handle the remainder:
for (; i < BYTES_TO_CHECKSUM; ++i)
{
    sum += *buffer++;
}

In this advantage, a secondary benefit is gained: more statements are executed before the processor has to reload the instruction cache.

I've had amazing results when I unrolled a loop to 32 statements. This was one of the bottlenecks since the program had to calculate a checksum on a 2GB file. This optimization combined with block reading improved performance from 1 hour to 5 minutes. Loop unrolling provided excellent performance in assembly language too, my memcpy was a lot faster than the compiler's memcpy. -- T.M.

Reduction of `if` statements

Processors hate branches, or jumps, since it forces the processor to reload its queue of instructions.

Boolean Arithmetic (Edited: applied code format to code fragment, added example)

Convert if statements into boolean assignments. Some processors can conditionally execute instructions without branching:

bool status = true;
status = status && /* first test */;
status = status && /* second test */;

The short circuiting of the Logical AND operator (&&) prevents execution of the tests if the status is false.

Example:

struct Reader_Interface
{
  virtual bool  write(unsigned int value) = 0;
};

struct Rectangle
{
  unsigned int origin_x;
  unsigned int origin_y;
  unsigned int height;
  unsigned int width;

  bool  write(Reader_Interface * p_reader)
  {
    bool status = false;
    if (p_reader)
    {
       status = p_reader->write(origin_x);
       status = status && p_reader->write(origin_y);
       status = status && p_reader->write(height);
       status = status && p_reader->write(width);
    }
    return status;
};

Factor Variable Allocation outside of loops

If a variable is created on the fly inside a loop, move the creation / allocation to before the loop. In most instances, the variable doesn't need to be allocated during each iteration.

Factor constant expressions outside of loops

If a calculation or variable value does not depend on the loop index, move it outside (before) the loop.

I/O in blocks

Read and write data in large chunks (blocks). The bigger the better. For example, reading one octect at a time is less efficient than reading 1024 octets with one read.
Example:

static const char  Menu_Text[] = "\n"
    "1) Print\n"
    "2) Insert new customer\n"
    "3) Destroy\n"
    "4) Launch Nasal Demons\n"
    "Enter selection:  ";
static const size_t Menu_Text_Length = sizeof(Menu_Text) - sizeof('\0');
//...
std::cout.write(Menu_Text, Menu_Text_Length);

The efficiency of this technique can be visually demonstrated. :-)

Don't use `printf` family for constant data

Constant data can be output using a block write. Formatted write will waste time scanning the text for formatting characters or processing formatting commands. See above code example.

Format to memory, then write

Format to a char array using multiple sprintf, then use fwrite. This also allows the data layout to be broken up into "constant sections" and variable sections. Think of mail-merge.

Declare constant text (string literals) as `static const`

When variables are declared without the static, some compilers may allocate space on the stack and copy the data from ROM. These are two unnecessary operations. This can be fixed by using the static prefix.

Lastly, Code like the compiler would

Sometimes, the compiler can optimize several small statements better than one complicated version. Also, writing code to help the compiler optimize helps too. If I want the compiler to use special block transfer instructions, I will write code that looks like it should use the special instructions.

Interesting can you provide an example where you got better code with a few small statements, instead of a larger one. Can you show an example of rewriting a if, using booleans. Generally, I would leave the loop unrolling to the compiler, as it probably has a better feel for cache size. I'm a bit surprised about the idea of sprintfing, then fwriting. I would think that fprintf actually does that under the hood. Can you give a bit more detail here? — EvilTeach, Jan 16 '10 at 20:49
There is no guarantee that `fprintf` formats to a separate buffer then outputs the buffer. A streamlined (for use of memory) `fprintf` would output all unformatted text, then format and output, and repeat until the entire format string is processed, thus making 1 output call for each type of output (formatted vs. unformatted). Other implementations would need to dynamically allocate memory for each call to hold the entire new string (which is bad in embedded systems environment). My suggestion reduces the number of outputs. — Thomas Matthews, Jan 17 '10 at 19:42
I once got a significant performance improvement by rolling up a loop. Then I figured out how to roll it up tighter by using some indirection, and the program got noticeably faster. (Profiling showed this particular function to be 60-80% of the runtime, and I tested performance carefully before and after.) I believe the improvement was due to better locality, but I'm not completely sure about that. — David Thornley, Apr 02 '10 at 15:11
Many of these are programmer optimizations rather than ways for programmers to help the compiler to optimization, which was the thrust of the original question. For example, loop unrolling. Yes, you can do your own unrolling, but I think it's more interesting to figure out what roadblocks there are to the compiler unrolling for you and removing those. — Adrian McCarthy, May 26 '11 at 15:31
In many cases, you can tell the compiler exactly what you want it to do rather than trying to give it enough hints. gcc for example has several builtins that can often be replaced with specialized instructions by the compiler. The downside is that you're now locked into gcc and you may have to compile for different specific CPU architectures to get the benefits of the builtin functions. You can also use `__attribute__` to tell the compiler how to treat some parts of your code. This is also a gcc feature so you need to decide if you want to lock yourself in to one compiler. — Kaiser Keister, Feb 15 '23 at 16:23

score 28 · Answer 5 · answered Jan 15 '10 at 19:47

The optimizer isn't really in control of the performance of your program, you are. Use appropriate algorithms and structures and profile, profile, profile.

That said, you shouldn't inner-loop on a small function from one file in another file, as that stops it from being inlined.

Avoid taking the address of a variable if possible. Asking for a pointer isn't "free" as it means the variable needs to be kept in memory. Even an array can be kept in registers if you avoid pointers — this is essential for vectorizing.

Which leads to the next point, read the ^#$@ manual! GCC can vectorize plain C code if you sprinkle a __restrict__ here and an __attribute__( __aligned__ ) there. If you want something very specific from the optimizer, you might have to be specific.

This is a good answer, but note that whole-program optimization is becoming more popular, and can in fact inline functions across translation units. — Phil Miller, Jan 15 '10 at 21:40
@Novelocrat Yep - needless to say I was very surprised the first time I saw something from `A.c` get inlined into `B.c`. — Jonathon Reinhart, Jan 21 '13 at 07:03

score 19 · Answer 6 · answered Jan 15 '10 at 19:56

19

On most modern processors, the biggest bottleneck is memory.

Aliasing: Load-Hit-Store can be devastating in a tight loop. If you're reading one memory location and writing to another and know that they are disjoint, carefully putting an alias keyword on the function parameters can really help the compiler generate faster code. However if the memory regions do overlap and you used 'alias', you're in for a good debugging session of undefined behaviors!

Cache-miss: Not really sure how you can help the compiler since it's mostly algorithmic, but there are intrinsics to prefetch memory.

Also don't try to convert floating point values to int and vice versa too much since they use different registers and converting from one type to another means calling the actual conversion instruction, writing the value to memory and reading it back in the proper register set.

answered Jan 15 '10 at 19:56

Francis Boivin

255
1
7

4

+1 for load-hit-stores and different register types. I'm not sure how big of a deal it is on in x86, but they're devestating on PowerPC (e.g. Xbox360 and Playstation3). – celion Jan 15 '10 at 21:59
Most papers on compiler loop optimization techniques assume perfect nesting, which means that the body of each loop except the innermost is just another loop. These papers simply do not discuss the steps necessary to generalize such, even if it's very clear that they can be. Thus, I'd expect a lot of implementations to not actually support those generalizations, because of the extra effort entailed. Thus, the many algorithms for optimizing cache usage in loops might work a lot better on perfect nests than on imperfect nests. – Phil Miller Jan 16 '10 at 00:42
You can avoid some cache misses by grouping related data in your code. For example if you order struct members such that padding between them is minimized, and you group members that will be accessed at the same time, then you can avoid some cache misses that might otherwise happen. – Kaiser Keister Feb 15 '23 at 16:27

score 11 · Answer 7 · edited Jan 15 '10 at 21:25

11

use const correctness as much as possible in your code. It allows the compiler to optimize much better.

In this document are loads of other optimization tips: CPP optimizations (a bit old document though)

highlights:

use constructor initialization lists
use prefix operators
use explicit constructors
inline functions
avoid temporary objects
be aware of the cost of virtual functions
return objects via reference parameters
consider per class allocation
consider stl container allocators
the 'empty member' optimization
etc

edited Jan 15 '10 at 21:25

Mike DeSimone

41,631
10
72
96

answered Jan 15 '10 at 19:30

Toad

15,593
16
82
128

8

Not much, rarely. It does improve actual correctness, though. – Potatoswatter Jan 15 '10 at 19:37
5

In C and C++ the compiler can't use const to optimize because casting it away is well-defined behavior. – dsimcha Jan 15 '10 at 19:57
+1 : const is a good example of something that will directly impact the compiled code. re @dsimcha's comment -- a good compiler will test to see if this happens. Of course, a good compiler will "find" const elements that are not declared that way anyway... – Hogan Jan 15 '10 at 20:54
@dsimcha: Changing a `const` *and* `restrict` qualified pointer, however, is undefined. So a compiler could optimize differently in such a case. – Dietrich Epp Jun 26 '11 at 02:33
7

@dsimcha casting away `const` on a `const` reference or `const` pointer to a non-`const` object is well-defined. modifying an actual `const` object (i.e. one declared as `const` originally) is not. – Stephen Lin Mar 15 '13 at 03:03

score 11 · Answer 8 · 2010-01-15T21:03:09.830

11

The vast majority of code that people write will be I/O bound (I believe all the code I have written for money in the last 30 years has been so bound), so the activities of the optimiser for most folks will be academic.

However, I would remind people that for the code to be optimised you have to tell the compiler to to optimise it - lots of people (including me when I forget) post C++ benchmarks here that are meaningless without the optimiser being enabled.

edited Jan 15 '10 at 21:03

answered Jan 15 '10 at 20:02

7

I confess to being peculiar -- I work on large scientific number-crunching codes which are memory-bandwidth bound. For the general population of programs I agree with Neil. – High Performance Mark Jan 15 '10 at 20:52
6

True; but an awful lot of that I/O-bound code nowadays is written in languages that are practically *pessimizers* -- languages that don't even have compilers. I suspect that the areas where C and C++ are still used will tend to be areas where it's more important to optimise something (CPU usage, memory usage, code size ...) – Porculus Jan 15 '10 at 21:39
1

I would say 99% is I/O bound but I've hit a few cases that weren't. One was a test to ensure there were no dead-ends in a combination of options--O(n^8), albeit heavily pruned and one to price a set of upgrades against a set of plans with variations. – Loren Pechtel Jan 16 '10 at 19:40
1

The code I work on (desktop apps with big UI and number-crunching) show me there's a nether-region between I/O-bound and CPU-bound. It's needless-either-one-bound, and it is fixed by finding activities being called for mid-stack that invoke an amazing amount of hoo-haw that may or may not (it doesn't matter) involve I/O. – Mike Dunlavey Mar 21 '10 at 20:50
3

I've spent most of the last 30 years working on code with very little I/O. Save for 2 years doing databases. Graphics, control systems, simulation - none of it I/O bound. If I/O was most peoples bottleneck, we wouldn't pay Intel and AMD much attention. – phkahler Apr 07 '10 at 16:53
2

Yeah I don't really buy this argument- otherwise we (at my job) wouldn't be looking for ways to spend more of the compute time also doing I/O. Also- much of the I/O bound software that I've come across has been I/O bound because the I/O was done sloppily; if one optimizes the access patterns (just like with memory), one can get huge gains in performance. – dash-tom-bang Apr 07 '10 at 18:41
@dash "otherwise we wouldn't be looking for ways to spend more of the compute time also doing I/O" - that is exactly what you need to do if your process is I/O bound - you need to find a way to get it to do more I/O in the same time slot. – Apr 07 '10 at 18:49
1

phkahler: Intel made my motherboard (including its integrated USB 2.0 and gigabit ethernet controllers), and are known for making the fastest SSDs. I think I/O is my bottleneck, and that's *why* I pay Intel so much attention (and money!). :-) – Ken Jun 20 '10 at 00:47
3

I've recently discovered that almost no code written in the C++ language is I/O bound. Sure, if you're calling an OS function for bulk disk transfer, your thread may go into I/O wait (but with caching, even that's questionable). But the usual I/O library functions, the ones everybody recommends because they're standard and portable, are actually miserably slow compared to modern disk technology (even the moderately priced stuff). Most likely, I/O is the bottleneck only if you're flushing all the way to disk after writing just a few bytes. OTOH, UI is a different matter, we humans are slow. – Ben Voigt Dec 04 '10 at 21:24

score 10 · Answer 9 · edited Apr 02 '10 at 15:04

Attempt to program using static single assignment as much as possible. SSA is exactly the same as what you end up with in most functional programming languages, and that's what most compilers convert your code to to do their optimizations because it's easier to work with. By doing this places where the compiler might get confused are brought to light. It also makes all but the worst register allocators work as good as the best register allocators, and allows you to debug more easily because you almost never have to wonder where a variable got it's value from as there was only one place it was assigned.
Avoid global variables.

When working with data by reference or pointer pull that into local variables, do your work, and then copy it back. (unless you have a good reason not to)

Make use of the almost free comparison against 0 that most processors give you when doing math or logic operations. You almost always get a flag for ==0 and <0, from which you can easily get 3 conditions:

x= f();
if(!x){
   a();
} else if (x<0){
   b();
} else {
   c();
}

is almost always cheaper than testing for other constants.

Another trick is to use subtraction to eliminate one compare in range testing.

#define FOO_MIN 8
#define FOO_MAX 199
int good_foo(int foo) {
    unsigned int bar = foo-FOO_MIN;
    int rc = ((FOO_MAX-FOO_MIN) < bar) ? 1 : 0;
    return rc;
}

This can very often avoid a jump in languages that do short circuiting on boolean expressions and avoids the compiler having to try to figure out how to handle keeping up with the result of the first comparison while doing the second and then combining them. This may look like it has the potential to use up an extra register, but it almost never does. Often you don't need foo anymore anyway, and if you do rc isn't used yet so it can go there.

When using the string functions in c (strcpy, memcpy, ...) remember what they return -- the destination! You can often get better code by 'forgetting' your copy of the pointer to destination and just grab it back from the return of these functions.

Never overlook the oppurtunity to return exactly the same thing the last function you called returned. Compilers are not so great at picking up that:

foo_t * make_foo(int a, int b, int c) {
        foo_t * x = malloc(sizeof(foo));
        if (!x) {
             // return NULL;
             return x; // x is NULL, already in the register used for returns, so duh
        }
        x->a= a;
        x->b = b;
        x->c = c;
        return x;
}

Of course, you could reverse the logic on that if and only have one return point.

(tricks I recalled later)

Declaring functions as static when you can is always a good idea. If the compiler can prove to itself that it has accounted for every caller of a particular function then it can break the calling conventions for that function in the name of optimization. Compilers can often avoid moving parameters into registers or stack positions that called functions usually expect their parameters to be in (it has to deviate in both the called function and the location of all callers to do this). The compiler can also often take advantage of knowing what memory and registers the called function will need and avoid generating code to preserve variable values that are in registers or memory locations that the called function doesn't disturb. This works particularly well when there are few calls to a function. This gets much of the benifit of inlining code, but without actually inlining.

It's actually not necessary to use subtraction when testing ranges, LLVM, GCC and my compiler at least do this automatically. Few people would probably understand what the code with subtraction does and even fewer why it actually works. — Gratian Lup, Aug 08 '12 at 19:02
in the example above, b() can't be called because if (x < 0) then a() will be called. — EvilTeach, Nov 21 '18 at 01:44
@EvilTeach No it won't. The comparison that results in the call to a() is !x — nategoose, Nov 22 '18 at 04:25
@EvilTeach In C 0 is false and everything else is true, so -3 is true, so !-3 is false — nategoose, Jan 10 '19 at 16:22

Gratian Lup · Answer 10 · 2012-08-09T07:31:15.697

I wrote an optimizing C compiler and here are some very useful things to consider:

Make most functions static. This allows interprocedural constant propagation and alias analysis to do its job, otherwise the compiler needs to presume that the function can be called from outside the translation unit with completely unknown values for the paramters. If you look at the well-known open-source libraries they all mark functions static except the ones that really need to be extern.
If global variables are used, mark them static and constant if possible. If they are initialized once (read-only), it's better to use an initializer list like static const int VAL[] = {1,2,3,4}, otherwise the compiler might not discover that the variables are actually initialized constants and will fail to replace loads from the variable with the constants.
NEVER use a goto to the inside of a loop, the loop will not be recognized anymore by most compilers and none of the most important optimizations will be applied.
Use pointer parameters only if necessary, and mark them restrict if possible. This helps alias analysis a lot because the programmer guarantees there is no alias (the interprocedural alias analysis is usually very primitive). Very small struct objects should be passed by value, not by reference.
Use arrays instead of pointers whenever possible, especially inside loops (a[i]). An array usually offers more information for alias analysis and after some optimizations the same code will be generated anyway (search for loop strength reduction if curious). This also increases the chance for loop-invariant code motion to be applied.
Try to hoist outside the loop calls to large functions or external functions that don't have side-effects (don't depend on the current loop iteration). Small functions are in many cases inlined or converted to intrinsics that are easy to hoist, but large functions might seem for the compiler to have side-effects when they actually don't. Side-effects for external functions are completely unknown, with the exception of some functions from the standard library which are sometimes modeled by some compilers, making loop-invariant code motion possible.
When writing tests with multiple conditions place the most likely one first. if(a || b || c) should be if(b || a || c) if b is more likely to be true than the others. Compilers usually don't know anything about the possible values of the conditions and which branches are taken more (they could be known by using profile information, but few programmers use it).
Using a switch is faster than doing a test like if(a || b || ... || z). Check first if your compiler does this automatically, some do and it's more readable to have the if though.

figurassa · Answer 11 · 2010-01-16T00:07:51.133

In the case of embedded systems and code written in C/C++, I try and avoid dynamic memory allocation as much as possible. The main reason I do this is not necessarily performance but this rule of thumb does have performance implications.

Algorithms used to manage the heap are notoriously slow in some platforms (e.g., vxworks). Even worse, the time that it takes to return from a call to malloc is highly dependent on the current state of the heap. Therefore, any function that calls malloc is going to take a performance hit that cannot be easily accounted for. That performance hit may be minimal if the heap is still clean but after that device runs for a while the heap can become fragmented. The calls are going to take longer and you cannot easily calculate how performance will degrade over time. You cannot really produce a worse case estimate. The optimizer cannot provide you with any help in this case either. To make matters even worse, if the heap becomes too heavily fragmented, the calls will start failing altogether. The solution is to use memory pools (e.g., glib slices ) instead of the heap. The allocation calls are going to be much faster and deterministic if you do it right.

My rule of thumb is if you have to dynamically allocate, get an array so you don't need to do it again. Preallocate them vectors. — EvilTeach, Nov 21 '18 at 01:49

Zan Lynx · Answer 12 · 2014-07-21T16:32:40.433

7

A dumb little tip, but one that will save you some microscopic amounts of speed and code.

Always pass function arguments in the same order.

If you have f_1(x, y, z) which calls f_2, declare f_2 as f_2(x, y, z). Do not declare it as f_2(x, z, y).

The reason for this is that C/C++ platform ABI (AKA calling convention) promises to pass arguments in particular registers and stack locations. When the arguments are already in the correct registers then it does not have to move them around.

While reading disassembled code I've seen some ridiculous register shuffling because people didn't follow this rule.

edited Jul 21 '14 at 16:32

answered Dec 01 '10 at 21:41

Zan Lynx

53,022
10
79
131

2

Neither C nor C++ make any guarantees about, or even mention, passing in particular registers or stack locations. It's the *ABI* (e.g. Linux ELF) that determines the details of parameter passing. – Emmet Jul 21 '14 at 15:30

score 5 · Answer 13 · edited Jun 20 '20 at 09:12

Two coding technics I didn't saw in the above list:

Bypass linker by writing code as an unique source

While separate compilation is really nice for compiling time, it is very bad when you speak of optimization. Basically the compiler can't optimize beyond compilation unit, that is linker reserved domain.

But if you design well your program you can can also compile it through an unique common source. That is instead of compiling unit1.c and unit2.c then link both objects, compile all.c that merely #include unit1.c and unit2.c. Thus you will benefit from all the compiler optimizations.

It's very like writing headers only programs in C++ (and even easier to do in C).

This technique is easy enough if you write your program to enable it from the beginning, but you must also be aware it change part of C semantic and you can meet some problems like static variables or macro collision. For most programs it's easy enough to overcome the small problems that occurs. Also be aware that compiling as an unique source is way slower and may takes huge amount of memory (usually not a problem with modern systems).

Using this simple technique I happened to make some programs I wrote ten times faster!

Like the register keyword, this trick could also become obsolete soon. Optimizing through linker begin to be supported by compilers gcc: Link time optimization.

Separate atomic tasks in loops

This one is more tricky. It's about interaction between algorithm design and the way optimizer manage cache and register allocation. Quite often programs have to loop over some data structure and for each item perform some actions. Quite often the actions performed can be splitted between two logically independent tasks. If that is the case you can write exactly the same program with two loops on the same boundary performing exactly one task. In some case writing it this way can be faster than the unique loop (details are more complex, but an explanation can be that with the simple task case all variables can be kept in processor registers and with the more complex one it's not possible and some registers must be written to memory and read back later and the cost is higher than additional flow control).

Be careful with this one (profile performances using this trick or not) as like using register it may as well give lesser performances than improved ones.

Yes, by now, LTO has made the first half of this post redundant and probably bad advice. — underscore_d, Mar 13 '16 at 21:35
@underscore_d: there are still some issues (mostly related to visibility of exported symbols), but from a mere performance point of view there is probably no ned any more. — kriss, Mar 14 '16 at 09:12

Hogan · Answer 14 · 2010-01-15T22:26:44.067

4

Most modern compilers should do a good job speeding up tail recursion, because the function calls can be optimized out.

Example:

int fac2(int x, int cur) {
  if (x == 1) return cur;
  return fac2(x - 1, cur * x); 
}
int fac(int x) {
  return fac2(x, 1);
}

Of course this example doesn't have any bounds checking.

Late Edit

While I have no direct knowledge of the code; it seems clear that the requirements of using CTEs on SQL Server were specifically designed so that it can optimize via tail-end recursion.

edited Jan 15 '10 at 22:26

answered Jan 15 '10 at 19:16

Hogan

69,564
10
76
117

At least when it comes to self-calls. Generalised tail-call optimisation is hard in, for example, C, due to the calling conventions most platforms use. – Vatine Jan 15 '10 at 19:21
1

the question is about C. C does not remove tail-recursion, so tail or other recursion, the stack might blow if the recursion goes too deep. – Toad Jan 15 '10 at 19:22
1

I have avoided the calling convention issue, by using a goto. There is less overhead that way. – EvilTeach Jan 15 '10 at 19:22
1

@reinier: what do you mean... a good c compiler could optimize tail end recursion. – Hogan Jan 15 '10 at 19:26
@vatine: true I was thinking self-calls. – Hogan Jan 15 '10 at 19:26
2

@hogan: this is new to me. Could you point to any compiler which does this? And how can you be sure it actually optimizes it? If it would do this one really needs to be sure it does it. It's not something you hope the compiler optimizer picks up on (like inlining which may or may not work) – Toad Jan 15 '10 at 19:28
6

@hogan: I stand corrected. You are right that Gcc and MSVC both do tail recursion optimization. – Toad Jan 15 '10 at 19:33
5

This example isn't tail recursion as its not the recursive call that is last, its the multiplication. – Brian Young Jan 15 '10 at 19:36
@reinier: For grad school I wrote a compiler that did it in a C like language (about 70% of C) it is really easy. – Hogan Jan 15 '10 at 19:49
@Roger+@Brian: My compiler was able to optimize the code as I coded it. This was back in 95, I'm guessing there is a modern language that does tail-end optimization requiring code like @Roger wrote so this is why you are all making crazy claims that the original code could not be optimized. – Hogan Jan 15 '10 at 20:32
@Hogan: comments were just saying it wasn't tail recursion (which it wasn't), not that it couldn't be optimized. Since that detracted from your point (even if not implemented by all compilers), I changed it; that's all. – Jan 17 '10 at 00:14

score 4 · Answer 15 · answered Jan 15 '10 at 20:05

4

I've actually seen this done in SQLite and they claim it results in performance boosts ~5%: Put all your code in one file or use the preprocessor to do the equivalent to this. This way the optimizer will have access to the entire program and can do more interprocedural optimizations.

answered Jan 15 '10 at 20:05

dsimcha

67,514
53
213
334

5

Putting functions that are used together in close physical proximity in source increases the likelihood that they will be near each other in object files and near each other in your executable. This improved locality of instructions can help avoid instruction cache misses while running. – oz10 Jan 15 '10 at 20:11
The AIX compiler has a compiler switch to encourage that behavior -qipa[=] | -qnoipa Turns on or customizes a class of optimizations known as interprocedural analysis (IPA). – EvilTeach Jan 15 '10 at 20:12
4

Best is to have a way to develop that does not require this. Using this fact as an excuse to write un-modular code will overall just result in code that is slow and has maintenance problems. – Hogan Jan 15 '10 at 21:00
3

I think this information is slightly dated. In theory, the whole-program-optimization features built into many compilers now (e.g. "Link-time Optimization" in gcc) allow for the same benefits, but with a totally standard workflow (plus faster recompilation times than putting it all in one file!) – Ponkadoodle Dec 28 '14 at 11:04
@Wallacoloo For sure, this is faaar outta date. FWIW, I just used GCC's LTO for the first time today, and - all else being equal at `-O3` - it blasted 22% of the original size off my program. (It's not CPU-bound, so I haven't got much to say about speed.) – underscore_d Mar 13 '16 at 21:32

score 4 · Answer 16 · answered Apr 07 '10 at 18:56

Don't do the same work over and over again!

A common antipattern that I see goes along these lines:

void Function()
{
   MySingleton::GetInstance()->GetAggregatedObject()->DoSomething();
   MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingElse();
   MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingCool();
   MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingReallyNeat();
   MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingYetAgain();
}

The compiler actually has to call all of those functions all of the time. Assuming you, the programmer, knows that the aggregated object isn't changing over the course of these calls, for the love of all that is holy...

void Function()
{
   MySingleton* s = MySingleton::GetInstance();
   AggregatedObject* ao = s->GetAggregatedObject();
   ao->DoSomething();
   ao->DoSomethingElse();
   ao->DoSomethingCool();
   ao->DoSomethingReallyNeat();
   ao->DoSomethingYetAgain();
}

In the case of the singleton getter the calls may not be too costly, but it is certainly a cost (typically, "check to see if the object has been created, if it hasn't, create it, then return it). The more complicated this chain of getters becomes, the more wasted time we'll have.

score 3 · Answer 17 · answered Jan 15 '10 at 21:23

Use the most local scope possible for all variable declarations.
Use const whenever possible
Dont use register unless you plan to profile both with and without it

The first 2 of these, especially #1 one help the optimizer analyze the code. It will especially help it to make good choices about what variables to keep in registers.

Blindly using the register keyword is as likely to help as hurt your optimization, It's just too hard to know what will matter until you look at the assembly output or profile.

There are other things that matter to getting good performance out of code; designing your data structures to maximize cache coherency for instance. But the question was about the optimizer.

score 3 · Answer 18 · answered Apr 07 '10 at 16:35

3

Align your data to native/natural boundaries.

answered Apr 07 '10 at 16:35

EvilTeach

28,120
21
85
141

score 3 · Answer 19 · answered Apr 09 '10 at 02:01

I was reminded of something that I encountered once, where the symptom was simply that we were running out of memory, but the result was substantially increased performance (as well as huge reductions in memory footprint).

The problem in this case was that the software we were using made tons of little allocations. Like, allocating four bytes here, six bytes there, etc. A lot of little objects, too, running in the 8-12 byte range. The problem wasn't so much that the program needed lots of little things, it's that it allocated lots of little things individually, which bloated each allocation out to (on this particular platform) 32 bytes.

Part of the solution was to put together an Alexandrescu-style small object pool, but extend it so I could allocate arrays of small objects as well as individual items. This helped immensely in performance as well since more items fit in the cache at any one time.

The other part of the solution was to replace the rampant use of manually-managed char* members with an SSO (small-string optimization) string. The minimum allocation being 32 bytes, I built a string class that had an embedded 28-character buffer behind a char*, so 95% of our strings didn't need to do an additional allocation (and then I manually replaced almost every appearance of char* in this library with this new class, that was fun or not). This helped a ton with memory fragmentation as well, which then increased the locality of reference for other pointed-to objects, and similarly there were performance gains.

score 3 · Answer 20 · edited May 23 '17 at 11:54

3

A neat technique I learned from @MSalters comment on this answer allows compilers to do copy elision even when returning different objects according to some condition:

// before
BigObject a, b;
if(condition)
  return a;
else
  return b;

// after
BigObject a, b;
if(condition)
  swap(a,b);
return a;

edited May 23 '17 at 11:54

Community

1
1

answered May 26 '11 at 15:44

Xeo

129,499
52
291
397

score 2 · Answer 21 · answered Jan 15 '10 at 20:11

2

If you've got small functions you call repeatedly, i have in the past got large gains by putting them in headers as "static inline". Function calls on the ix86 are surprisingly expensive.

Reimplementing recursive functions in a non-recursive way using an explicit stack can also gain a lot, but then you really are in the realm of development time vs gain.

answered Jan 15 '10 at 20:11

Remy

329
1
10

Converting recursion to a stack is an *assumed* optimization on ompf.org, for people developing raytracers and writing other rendering algorithms. – Tom Apr 02 '10 at 16:22
...I should add to this, that the biggest overhead in my personal raytracer project is vtable-based recursion through a bounding-volume hierarchy using the Composite pattern. It's really just a bunch of nested boxes structured as a tree, but using the pattern causes data bloat (virtual table pointers) and reduces instruction coherency (what could be a small/tight loop is now a chain of function calls) – Tom Apr 02 '10 at 16:26

score 2 · Answer 22 · answered Jan 15 '10 at 21:04

Here's my second piece of optimisation advice. As with my first piece of advice this is general purpose, not language or processor specific.

Read the compiler manual thoroughly and understand what it is telling you. Use the compiler to its utmost.

I agree with one or two of the other respondents who have identified selecting the right algorithm as critical to squeezing performance out of a program. Beyond that the rate of return (measured in code execution improvement) on the time you invest in using the compiler is far higher than the rate of return in tweaking the code.

Yes, compiler writers are not from a race of coding giants and compilers contain mistakes and what should, according to the manual and according to compiler theory, make things faster sometimes makes things slower. That's why you have to take one step at a time and measure before- and after-tweak performance.

And yes, ultimately, you might be faced with a combinatorial explosion of compiler flags so you need to have a script or two to run make with various compiler flags, queue the jobs on the large cluster and gather the run time statistics. If it's just you and Visual Studio on a PC you will run out of interest long before you have tried enough combinations of enough compiler flags.

Regards

Mark

When I first pick up a piece of code I can usually get a factor of 1.4 -- 2.0 times more performance (ie the new version of the code runs in 1/1.4 or 1/2 of the time of the old version) within a day or two by fiddling with compiler flags. Granted, that may be a comment on the lack of compiler savvy among the scientists who originate much of the code I work on, rather than a symptom of my excellence. Having set the compiler flags to max (and it's rarely just -O3) it can take months of hard work to get another factor of 1.05 or 1.1

score 2 · Answer 23 · answered Jan 16 '10 at 21:20

2

When DEC came out with its alpha processors, there was a recommendation to keep the number of arguments to a function under 7, as the compiler would always try to put up to 6 arguments in registers automatically.

answered Jan 16 '10 at 21:20

EvilTeach

28,120
21
85
141

x86-64 bit also allow a lot of register-passed parameters, which can have a dramatic effect on function call overhead. – Tom Apr 02 '10 at 16:31

score 1 · Answer 24 · answered Jan 15 '10 at 19:50

1

For performance, focus first on writing maintenable code - componentized, loosely coupled, etc, so when you have to isolate a part either to rewrite, optimize or simply profile, you can do it without much effort.

Optimizer will help your program's performance marginally.

answered Jan 15 '10 at 19:50

Ariel

5,752
5
49
59

3

That only works if the coupling "interfaces" themselves are amenable to optimization. An interface can be inherently "slow", e.g. by forcing redundant lookups or calculations, or forcing bad cache access. – Tom Apr 02 '10 at 16:28

score 1 · Answer 25 · edited May 23 '17 at 12:34

You're getting good answers here, but they assume your program is pretty close to optimal to begin with, and you say

Assume that the program has been written correctly, compiled with full optimization, tested and put into production.

In my experience, a program may be written correctly, but that does not mean it is near optimal. It takes extra work to get to that point.

If I can give an example, this answer shows how a perfectly reasonable-looking program was made over 40 times faster by macro-optimization. Big speedups can't be done in every program as first written, but in many (except for very small programs), it can, in my experience.

After that is done, micro-optimization (of the hot-spots) can give you a good payoff.

score 1 · Answer 26 · answered Jan 16 '10 at 19:09

1

i use intel compiler. on both Windows and Linux.

when more or less done i profile the code. then hang on the hotspots and trying to change the code to allow compiler make a better job.

if a code is a computational one and contain a lot of loops - vectorization report in intel compiler is very helpful - look for 'vec-report' in help.

so the main idea - polish the performance critical code. as for the rest - priority to be correct and maintainable - short functions, clear code that could be understood 1 year later.

answered Jan 16 '10 at 19:09

jf.

1
1

You are getting close to answering the question..... what sorts of things do you do to the code, to make it possible for the compiler to do those sorts of optimizations? – EvilTeach Jan 16 '10 at 20:43
1

Trying to write more in C-style (vs. in C++) e.g. avoiding virtual functions w/o absolute need, especially if they going to be called often, avoid AddRefs.. and all cool stuff (again unless it really needed). Write code easy for inlining - fewer parameters, less "if"-s. Not use global variables unless absolute need. In data structure - put wider fields first (double, int64 goes before int) - so compiler align struct on first field natural size - aligning good for perf. – jf. Jan 18 '10 at 21:34
1

Data layout and access are absolutely critical for performance. So after profiling - i sometimes break a structure into several ones following locality of accesses. One more general trick - use int or size-t vs. char - even data values are small - avoid various perf. penalties store to load blocking, issues with partial registers stalls. of course this doesn't applicable when need big arrays of such data. – jf. Jan 18 '10 at 21:42
One more - avoid system calls, unless there is a real need :) - they are VERY expensive – jf. Jan 18 '10 at 21:45
Enabling vectorization is specific to a loop. Often vectorization could be enabled by providing compiler a hint to ignore vector dependencies - if this is a case. More on this: http://www.intel.com/software/products/compilers/docs/clin/main_cls/cref_cls/common/cppref_pragma_ivdep.htm – jf. Jan 18 '10 at 21:51
2

@jf: I did +1 on your answer, but please could you move the answer from comments to answer body ? It will be easier to read. – kriss Mar 21 '10 at 21:32

score 1 · Answer 27 · answered Jan 16 '10 at 21:14

1

One optimization i have used in C++ is creating a constructor that does nothing. One must manually call an init() in order to put the object into a working state.

This has benefit in the case where I need a large vector of these classes.

I call reserve() to allocate the space for the vector, but the constructor does not actually touch the page of memory the object is on. So I have spent some address space, but not actually consumed a lot of physical memory. I avoid the page faults associated the associated construction costs.

As i generate objects to fill the vector, I set them using init(). This limits my total page faults, and avoids the need to resize() the vector while filling it.

answered Jan 16 '10 at 21:14

EvilTeach

28,120
21
85
141

6

I believe a typical implementation of std::vector doesn't actually construct more objects when you reserve() more capacity. It just allocates pages. The constructors are called later, using placement new, when you actually add objects to the vector -- which is (presumably) just before you call init(), so you don't really need the separate init() function. Also remember that even if your constructor is "empty" in the source code, the compiled constructor may contain code to initialize things like virtual tables and RTTI, so the pages get touched at construction time anyway. – Wyzard Jan 17 '10 at 18:38
1

Yep. In our case we use push_back for populating the vector. The objects don't have any virtual functions, so it is not a problem. The first time we tried it with constructor, we were astounded by the volume of page faults. I realized what happened, and we yanked the guts of the constructor, and the page fault problem vanished. – EvilTeach Jan 18 '10 at 04:11
That rather surprises me. What C++ and STL implementations were you using? – David Thornley Apr 02 '10 at 15:15
3

I agree with the others, this sounds like a bad implementation of std::vector. Even if your objects did have vtables, they wouldn't be constructed until your push_back. You should be able to test this by declaring the default constructor to be private, because all vector will need is the copy-constructor for push_back. – Tom Apr 02 '10 at 16:21
1

@David - The implementation was on AIX. – EvilTeach Apr 07 '10 at 16:23
This is interesting as an anecdote, but to be clear, it was a problem caused by the implementation of `std::vector` on that machine - _not_ a piece of advice that is generally applicable, and it might be harmful (e.g. if some reader naively assumes they need to use _init_ functions everwhere ;-) – underscore_d Mar 13 '16 at 21:41
Nah. Pretty much all systems use virtual memory. – EvilTeach Mar 13 '16 at 22:33

score 1 · Answer 28 · answered Apr 02 '10 at 15:21

One thing I've done is try to keep expensive actions to places where the user might expect the program to delay a bit. Overall performance is related to responsiveness, but isn't quite the same, and for many things responsiveness is the more important part of performance.

The last time I really had to do improvements in overall performance, I kept an eye out for suboptimal algorithms, and looked for places that were likely to have cache problems. I profiled and measured performance first, and again after each change. Then the company collapsed, but it was interesting and instructive work anyway.

EvilTeach · Answer 29 · 2010-01-15T21:57:13.283

0

I have long suspected, but never proved that declaring arrays so that they hold a power of 2, as the number of elements, enables the optimizer to do a strength reduction by replacing a multiply by a shift by a number of bits, when looking up individual elements.

edited Jan 15 '10 at 21:57

answered Jan 15 '10 at 21:46

EvilTeach

28,120
21
85
141

6

That used to be true, nowadays it is anymore. Infact the exactly the opposite is true. If you declare your arrays with powers of two you will very likely run into the situation that you work on two pointers a power of two apart in memory. The problem is, that the CPU caches is organized just like that and you may end up with the two arrays fighting around one cache-line. You get horrible performance that way. Having one of the pointers a couple of bytes ahead (e.g. non power of two) prevents this situation. – Nils Pipenbrinck Mar 21 '10 at 14:55
+1 Nils, and one specific occurrence of this is "64k aliasing" on Intel hardware. – Tom Apr 02 '10 at 16:30
This is something that is easily disproved by looking at the disassembly, by the way. I was amazed, years ago, at seeing how gcc would optimize all sorts of constant multiplications with shifts and adds. E.g. `val * 7` turned into what would otherwise look like `(val << 3) - val`. – dash-tom-bang Apr 09 '10 at 18:56

EvilTeach · Answer 30 · 2016-03-13T22:34:48.973

One of the thing I vaguely remember from cobol in the 80s, was that there were linker options that allowed you to effect the order in which functions were linked together. This allowed you to (possibly) increase code locality.

Along that same idea. If have wondered if a possible optimization could be achieved by using the pattern

for (some silly loop)
if (something)
    if (somthing else)
        if (somthing else)
            if (somthing else)
                /* This is the normal expected case */ 
            else error 4
        else error 3
    else error 2
else error 1

The for head, and ifs, may fit into a cache block, which in theory could lead to faster loop execution.

I would guess that the elses being similar could be optimized to some degree.

Comments? Am I dreaming?

score 0 · Answer 31 · answered May 26 '11 at 16:47

0

Put small and/or frequently called functions at the top of the source file. That makes it easier for the compiler to find opportunities for inlining.

answered May 26 '11 at 16:47

Mark Ransom

299,747
42
398
622

Really? Can you cite a rationale and examples for this? Not saying it's untrue, just it sounds unintuitive that location would matter. – underscore_d Apr 13 '16 at 19:40
@underscore_d it can't inline something until the function definition is known. While modern compilers might make multiple passes so that the definition is known at code generation time, I don't assume it. – Mark Ransom Apr 13 '16 at 19:43
I'd assumed compilers work off abstract call graphs rather than physical function order, meaning this wouldn't matter. Sure, I suppose it doesn't hurt to be extra careful - especially when, performance aside, IMO it just seems more logical to define functions that are called before those that call them. I'd have to test performance but would be surprised if it mattered, but until then, I'm open to being surprised! – underscore_d Apr 13 '16 at 20:52

score -4 · Answer 32 · answered Jan 15 '10 at 19:54

-4

Let the optimizer do its job.

Seriously. Don't try to outsmart the optimizer. It was designed by brilliant people with way, way more experience than you.

answered Jan 15 '10 at 19:54

Aaronaught

120,909
25
266
342

2

That is not always the case. There are still things that the compiler has insufficient data to optimize and perform necessary modifications. Here is an example - http://www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/ – LiraNuna Jan 15 '10 at 20:01
11

-1 Question Nullification. The whole point is to ask people on SO who have that experience. Many of us write compilers + optimizers. – Hogan Jan 15 '10 at 20:58
@Hogan: Call a spade a spade, this is premature optimization at its finest. Absolutely none of the upvoted answers here actually answer the question directly; they all talk about things that the optimizer **doesn't optimize**, not how make things "easier" for the optimizer. – Aaronaught Jan 15 '10 at 21:51
@Aaronaught: I think my answer does address that. In theory any recursive routine can be made flat and thus optimized. It is easy for compilers to optimize true or close-to-true tail end recursion. This might change how you write code to help the optimizer. – Hogan Jan 15 '10 at 22:23
1

@Hogan, I frankly do not understand your fascination with tail recursion, and simply saying that compilers can optimize it isn't actually an answer to this question. Compilers can inline functions too and do a lot of other things; so what? Rarely will that affect how you actually need to write the code, and almost never will it translate into a real-world performance improvement. Why rely on the compiler to transform your tail recursion into a loop when some compilers might not do it? Just write a loop in the first place! – Aaronaught Jan 15 '10 at 22:46
@Aaronaught - I don't understand it either. Mostly it is the only answer to this question I could think of -- I know lots of ways to speed up / optimize programs, but making your code TER friendly is the only one I can think of that addresses the question, as you pointed out. – Hogan Jan 16 '10 at 03:24
@Aaronaught Hogan directly addressed the question. Some progammers will read his answer, and ask "What is tail recursion", read up on it, and maybe learn something. It is a win win answer. – EvilTeach Jan 16 '10 at 20:56
@Aaronaught - You sound like you're advocating ignorance of optimization techniques to avoid related bad habits. That's a terrible trend in the community that leads to retardation of new members . But then again, I suppose that does make those of us with the knowledge more valuable, so at least it'll keep my salary high. – Tom Apr 02 '10 at 16:37
@Tom: You seem to be confusing code/design performance optimizations with compiler optimizations. The subset of programs whose performance characteristics are CPU-bound is vanishingly small; if you write I/O bound programs and waste hours making these types of micro-optimizations, you don't deserve that high salary. The only "retardation" I see here is reliance on penny-wise pound-foolish compiler tricks. I invite you to show me a single, modern, well-designed program other than an operating system kernel whose performance is significantly improved by such things. – Aaronaught Apr 02 '10 at 16:57
1

OK, here's an example: Programs that rely on SIMD (e.g. graphics/rendering/image processing) tend to require aligned data structures in order to avoid copying data all over the place. This is definitely in the realm of compiler optimizations, as it has to be done with compiler-specific instructions (__attribute__((aligned(x))) for GCC). – Tom Apr 05 '10 at 00:28
@Aaronaught - Almost every program I've ever written is CPU bound. I have done database work, and have an appreciation for it, but I feel sorry for anyone who has never ventured out of that place. You're suffering from "all my experience is X, therefore everyone else's must also be X" syndrome. – phkahler Apr 07 '10 at 17:06
@phkahler: You're incorrect, I've written CPU bound code as well. These kinds of cheap tricks rarely make a difference. Remember, this question is not asking about code optimization, compiler switches, that sort of thing, it's asking about how you'd rely on the *implementation details* of a particular optimizer in order to shave nanoseconds off the execution time. – Aaronaught Apr 07 '10 at 17:22
1

I think the simple fact is that many solutions can be made faster by telling the computer to operate in the most efficient way. A good knowledge of what the compiler will do can help that, but the compiler won't be able to optimize everything. -- More importantly, imo, is that it is easier for a compiler to optimize simple code. I suspect the downvotes come in reaction to "don't try to outsmart..." Nobody's advocating tricking it, folks are saying, "work with the compiler." – dash-tom-bang Apr 09 '10 at 18:58

Coding Practices which enable the compiler/optimizer to make a faster program

32 Answers32

Generic Optimizations

Declare small functions as `inline` or macros

Remove dead and redundant code

Simplify design of algorithms

Loop Unrolling

Reduction of `if` statements

Boolean Arithmetic (Edited: applied code format to code fragment, added example)

Factor Variable Allocation outside of loops

Factor constant expressions outside of loops

I/O in blocks

Don't use `printf` family for constant data

Format to memory, then write

Declare constant text (string literals) as `static const`

Lastly, Code like the compiler would

Let the optimizer do its job.

Linked

Related

Coding Practices which enable the compiler/optimizer to make a faster program

32 Answers32

Generic Optimizations

Declare small functions as inline or macros

Remove dead and redundant code

Simplify design of algorithms

Loop Unrolling

Reduction of if statements

Boolean Arithmetic (Edited: applied code format to code fragment, added example)

Factor Variable Allocation outside of loops

Factor constant expressions outside of loops

I/O in blocks

Don't use printf family for constant data

Format to memory, then write

Declare constant text (string literals) as static const

Lastly, Code like the compiler would

Let the optimizer do its job.

Linked

Related

Declare small functions as `inline` or macros

Reduction of `if` statements

Don't use `printf` family for constant data

Declare constant text (string literals) as `static const`