In C++, should I bother to cache variables, or let the compiler do the optimization? (Aliasing)

Question

Consider the following code (p is of type unsigned char* and bitmap->width is of some integer type, exactly which is unknown and depends on which version of some external library we're using):

for (unsigned x = 0;  x < static_cast<unsigned>(bitmap->width);  ++x)
{
    *p++ = 0xAA;
    *p++ = 0xBB;
    *p++ = 0xCC;
}

_{Is it worth optimizing it [..]}

Could there be a case where this could yield more efficient results by writing:

unsigned width(static_cast<unsigned>(bitmap->width));
for (unsigned x = 0;  x < width;  ++x)
{
    *p++ = 0xAA;
    *p++ = 0xBB;
    *p++ = 0xCC;
}

... or is this trivial for the compiler to optimize?

_{What would you consider to be "better" code?}

_{Note from editor (Ike): for those wondering about the strikeout text, the original question, as phrased, was dangerously close to off-topic territory and was very close to being closed in spite of positive feedback. These have been stricken out. Yet please do not punish the answerers who addressed these stricken sections of the question.}

If `*p` is of the same type as `width` then it's not trivial to optimize, since `p` could point to `width` and modify it inside the loop. — Emil Laine, Nov 24 '15 at 16:17
Asking about whether the compiler optimizes a particular operation is usually the wrong question. What you're (usually) ultimately interested in is which version runs faster, which you should simply measure. — SirGuy, Nov 24 '15 at 16:20
@GuyGreer I do agree, though I'd say the question is good, or at least interesting, thought unfortunately the answer is "you gotta measure it, per use-case". The reason is that functionality is portable but performance is not. So it actually depends on every part of the build process, starting on the compiler and finishing at the target site (os/hardware combination). And of course best guess is that the compiler is smarter than the human at this. — luk32, Nov 24 '15 at 16:29
If I was a compiler, I would see that your two examples aren't the same. It's possible that `p` points to the same memory as `bitmap->width`. Therefore I can't legally optimize the first example to the second one. — Mysticial, Nov 24 '15 at 16:32
@luk32 You can see my answer for more of my thoughts on this, I like these questions in terms of the glimpses they give into how compilers work. I just want to make sure the OP understands that there are good and bad reasons for asking these types of questions. — SirGuy, Nov 24 '15 at 16:36
Where is "p" stored? I would suggest that you might get a really huge performance win by doing something like "char * restrict p2 = p;" and then using "p2" rather than "p" within your loop. Then, if you want the changes to "p2" applied back to p, use "p += (p2-p);". Note that no pointer written within p2's lifetime by a pointer not copied form p2 may be read using a pointer copied from p2, nor vice versa, and no copy of p2 may be used for any purpose after p2's lifetime, but a compiler can use those facts to enable optimizations which cannot be accomplished via any other means. — supercat, Nov 24 '15 at 19:34
@supercat: Weird, I've tried with `__restrict__`, and the results were the same. See my edited answer. — Yaron Cohen-Tal, Nov 24 '15 at 21:11
@YaronCohen-Tal: Are you saying the results with \_\_restrict\_\_ were the same as the cached version or the uncached version? Is "bitmap" coming from a place that gcc knows nothing about? I'd suggest storing it to a volatile variable and reading it back to ensure that it is. A good compiler should be able to perform the optimization in the presence of "restrict" but not otherwise. — supercat, Nov 24 '15 at 21:37
There are different ideas of *better* code. Considering loop it self, simplify minimizing branching is better code, so as fusing loop with others and code motion of code not dependent on induction variable In nutshell compiler takes liberty of doing optimizations that serve best to architecture of machine. — Mahesh Attarde, Nov 25 '15 at 05:59
This question is being discussed on meta: http://meta.stackoverflow.com/questions/311122/are-compiler-optimization-questions-appropriate-for-so — , Nov 25 '15 at 18:52
If you can do the lifting, why not do it? There's a tendency to assume we should be lazy coders (in the name of "readability") just because the general mantra is that compiler optimizers are pretty good. Well, if I'm a compiler optimizer-writer, I know I can't possibly understand the code as well as you do, so don't expect magic from me. You do what you can, and then I will do what I can. — Mike Dunlavey, Nov 25 '15 at 21:42
Although the aliasing problem is the interesting part of all this and the answers should reflect that, I do want to point out that from a practical perspective, caching a simple cast is not going to save enough time to to be worth it (unless it's in the middle of a *very* critical loop), but if it's something like `strlen(foo)` or another call to a pure function, it almost certainly will be worth caching the result, since the compiler will *not* be likely to realize that those could be optimized, and they take a significant amount of time per call. — Ray, Nov 25 '15 at 21:51
@Ray The cast costs nothing typically. The missed optimization opportunities resulting from aliasing can cost quite a bit in an image processing context where, by nature, the loops are very critical (millions of pixels to process, e.g.). Of course avoiding redundant `strlen` calls and stuff like that will tend to help in more ordinary kind of code where it might not be so loopy and more about avoiding tiny efficiencies. — , Nov 25 '15 at 22:19
Ike, we're in agreement. I may have phrased my comment poorly, but my key point was that the interesting part of the problem is that width potentially aliases one of the values being pointed at, and so the cast can't be optimized away, but that in this particular case, eliminating the cast saves at most one memory lookup per iteration, so it's not worth it (especially since width will likely remain in L1 or L2 for the duration). But that it *would* be worth it if either the repeated expression was more expensive *or* the loop was tight enough that saving a single instruction would pay off. — Ray, Nov 25 '15 at 22:44
@ike for what it's worth, in the image processing that I do (for remote sensing), we talk about images that are hundreds of millions of pixels and we can have anywhere from tens to thousands of images. For us a typical data set is a couple of terabytes, to give some perspective on scope. — SirGuy, Nov 26 '15 at 05:31
I think it's worth pointing out that if you have two pointers and the compiler can determine that the range of a loop does not change then the compiler can generate code to check if the pointers overlap within the range and generate two branches (e.g. one branch does `memcpy` and one branch does`memmove`). This is what GCC does e.g. with auto-vectorization so using `restrict` only reduces the amount code and the check and in practice does not help much. However, for the OPs question the range of the loop can change so this does not apply (which is why this is a comment and not an answer). — Z boson, Dec 03 '15 at 09:31

score 82 · Answer 1 · edited May 23 '17 at 10:30

At first glance, I thought the compiler could generate equivalent assembly for both versions with optimization flags activated. When I checked it, I was surprised to see the result:

Source `unoptimized.cpp`

note: this code is not meant to be executed.

struct bitmap_t
{
    long long width;
} bitmap;

int main(int argc, char** argv)
{
    for (unsigned x = 0 ; x < static_cast<unsigned>(bitmap.width) ; ++x)
    {
        argv[x][0] = '\0';
    }
    return 0;
}

Source `optimized.cpp`

note: this code is not meant to be executed.

struct bitmap_t
{
    long long width;
} bitmap;

int main(int argc, char** argv)
{
    const unsigned width = static_cast<unsigned>(bitmap.width);
    for (unsigned x = 0 ; x < width ; ++x)
    {
        argv[x][0] = '\0';
    }
    return 0;
}

Compilation

$ g++ -s -O3 unoptimized.cpp
$ g++ -s -O3 optimized.cpp

Assembly (unoptimized.s)

    .file   "unoptimized.cpp"
    .text
    .p2align 4,,15
.globl main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    .cfi_personality 0x3,__gxx_personality_v0
    movl    bitmap(%rip), %eax
    testl   %eax, %eax
    je  .L2
    xorl    %eax, %eax
    .p2align 4,,10
    .p2align 3
.L3:
    mov %eax, %edx
    addl    $1, %eax
    movq    (%rsi,%rdx,8), %rdx
    movb    $0, (%rdx)
    cmpl    bitmap(%rip), %eax
    jb  .L3
.L2:
    xorl    %eax, %eax
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
.globl bitmap
    .bss
    .align 8
    .type   bitmap, @object
    .size   bitmap, 8
bitmap:
    .zero   8
    .ident  "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-16)"
    .section    .note.GNU-stack,"",@progbits

Assembly (optimized.s)

    .file   "optimized.cpp"
    .text
    .p2align 4,,15
.globl main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    .cfi_personality 0x3,__gxx_personality_v0
    movl    bitmap(%rip), %eax
    testl   %eax, %eax
    je  .L2
    subl    $1, %eax
    leaq    8(,%rax,8), %rcx
    xorl    %eax, %eax
    .p2align 4,,10
    .p2align 3
.L3:
    movq    (%rsi,%rax), %rdx
    addq    $8, %rax
    cmpq    %rcx, %rax
    movb    $0, (%rdx)
    jne .L3
.L2:
    xorl    %eax, %eax
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
.globl bitmap
    .bss
    .align 8
    .type   bitmap, @object
    .size   bitmap, 8
bitmap:
    .zero   8
    .ident  "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-16)"
    .section    .note.GNU-stack,"",@progbits

diff

$ diff -uN unoptimized.s optimized.s
--- unoptimized.s   2015-11-24 16:11:55.837922223 +0000
+++ optimized.s 2015-11-24 16:12:02.628922941 +0000
@@ -1,4 +1,4 @@
-   .file   "unoptimized.cpp"
+   .file   "optimized.cpp"
    .text
    .p2align 4,,15
 .globl main
@@ -10,16 +10,17 @@
    movl    bitmap(%rip), %eax
    testl   %eax, %eax
    je  .L2
+   subl    $1, %eax
+   leaq    8(,%rax,8), %rcx
    xorl    %eax, %eax
    .p2align 4,,10
    .p2align 3
 .L3:
-   mov %eax, %edx
-   addl    $1, %eax
-   movq    (%rsi,%rdx,8), %rdx
+   movq    (%rsi,%rax), %rdx
+   addq    $8, %rax
+   cmpq    %rcx, %rax
    movb    $0, (%rdx)
-   cmpl    bitmap(%rip), %eax
-   jb  .L3
+   jne .L3
 .L2:
    xorl    %eax, %eax
    ret

The generated assembly for the optimized version does actually load (lea) the width constant unlike the unoptimized version which computes the width offset at each iteration (movq).

When I'll get time, I eventually post some benchmark on that. Good question.

It would be interesting to see if the code was generated differently if you cast to `const unsigned` instead of just `unsigned` in the unoptimized case. — Mark Ransom, Nov 24 '15 at 17:32
Will do. I'm accepting requests if someone else have an idea. I'll try to post an analyse in the next days. — YSC, Nov 24 '15 at 17:34
On version `5.2.0` of gcc, the output is somewhat different with both `-O3` and `-Ofast` flags. I don't understand this syntax(I guess this is not the one Intel uses) so it would be nice if you could have a look at this. Also if possible someone could look at what `6.x` does. — Registered User, Nov 25 '15 at 05:51
Also in the commands that you have given, you probably meant `-S` instead of `-s` i.e capital S — Registered User, Nov 25 '15 at 05:52
@MarkRansom I guess it shouldn't make a difference: The "promise" of being const is only during the single comparison, not for the whole loop — Hagen von Eitzen, Nov 25 '15 at 10:57
I think that was a flawed test because the alias behavior of `char` is special. You should have tested with `unsigned char` — JSF, Nov 25 '15 at 14:15
@JSF The special rule of `char` [also applies to `unsigned char`](http://en.cppreference.com/w/cpp/language/reinterpret_cast#Type_aliasing) — jingyu9575, Nov 26 '15 at 11:31
Please **NEVER** use the function `main` to test for an optimization. Gcc purposefully marks it as cold and thus disables some optimizations for it. I don't know if that's the case here, but that's an important habit to get into. — Marc Glisse, Nov 26 '15 at 15:47
@MarcGlisse You're 100% right. I've wrote it in a hurry, I'll improve that. — YSC, Nov 26 '15 at 15:51
Here's a link to [both functions in one compilation unit on godbolt](http://goo.gl/9NiWOS), assuming `bitmap` is a global. The non-CSEd version uses a memory-operand to `cmp`, which isn't a problem for perf in this case. If it was a local, the compiler could assume other pointers couldn't "know about" it and point into it. It's not a bad idea to store expressions involving globals in temp variables, as long as it improves (or doesn't hurt) readability, or if performance is critical. Unless there's a lot going on, such locals can usually just live in registers, and never be spilled. — Peter Cordes, Nov 27 '15 at 09:19
As a notice, I've bountied this question yet again. This one has a lot of potential. — , Dec 18 '15 at 18:11
@VioletGiraffe https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/predict.c;h=d5c40def863f0c87b6ff4aa7e998a9985dc6045e;hb=HEAD#l2953 — Marc Glisse, Dec 23 '15 at 22:20
@MarcGlisse: interesting, but it's unclear what the effects of `NODE_FREQUENCY_EXECUTED_ONCE` are. — Violet Giraffe, Dec 24 '15 at 06:10

score 38 · Answer 2 · edited Nov 25 '15 at 16:12

38

There is actually insufficient information from your code snippet to be able to tell, and the one thing that I can think of is aliasing. From our point of view, it's pretty clear that you don't want p and bitmap to point to the same location in memory, but the compiler doesn't know that and (because p is of type char*) the compiler has to make this code work even if p and bitmap overlap.

This means in this case that if the loop changes bitmap->width through the pointer p then that has to be seen when re-reading bitmap->width later on, which in turn means that storing it in a local variable would be illegal.

That being said, I believe some compilers will actually sometimes generate two versions of the same code (I have seen circumstantial evidence of this, but never directly sought out information on what the compiler is doing in this case), and quickly check if the pointers alias and run the faster code if it determines it's okay to.

That being said, I stand by my comment about simply measuring the performance of the two versions, my money is on not seeing any consistent performance difference between the two versions of the code.

In my opinion, questions like these are okay if your purpose is to learn about compiler optimization theories and techniques, but is a waste of time (a useless micro-optimization) if your end goal here is to make the program run faster.

edited Nov 25 '15 at 16:12

user

6,897
8
43
79

answered Nov 24 '15 at 16:33

SirGuy

10,660
2
36
66

@Mysticial From what I understand, it is the number one optimization blocker in practise. I do wish I knew more about the performance implications about it and just what compilers can/can't do about it. – SirGuy Nov 24 '15 at 16:38
1

@GuyGreer: It is a major optimization-blocker; I consider it unfortunate that the language rules focus on rules about effective types, rather than identifying situations where writes and reads of different items are or are not unsequenced. Rules written in such term could do a much better job of meeting compiler and programmer needs than the present ones. – supercat Nov 24 '15 at 19:39
3

@GuyGreer -- wouldn't a `restrict` qualifier be the answer to the aliasing problem in this case? – LThode Nov 24 '15 at 20:56
@LThode `restrict` is not standard C++ but is offered by gcc and msvc with some differences in where they can appear. I'm tempted to say that storing the width in a local variable is better for readability, but if there were other variables that could alias, it could end up being a lot of local variables instead of a few `restrict` declarations. – SirGuy Nov 24 '15 at 21:26
4

In my experience, `restrict` is largely hit-and-miss. MSVC is the only compiler I've seen that seems to do it properly. ICC loses aliasing info through function calls even if they're inlined. And GCC usually fails to get any benefit unless you declare every single input parameter as `restrict` (including `this` for member functions). – Mysticial Nov 24 '15 at 22:05
1

@Mystical: One thing to remember is that `char` aliases all types, so if you have a char* then you have to use `restrict` on everything. Or if you have forced GCC's strict aliasing rules off with `-fno-strict-aliasing` then everything is considered a possible alias. – Zan Lynx Nov 25 '15 at 00:09
@GuyGreer -- that sounds...awfully gratuitous (in the sense that its a gratuitous incompatibility between C and C++) – LThode Nov 25 '15 at 15:31
LThode, restrict got added to C after C++ was created. So the C++ committee didn't create an incompatability by leaving it out; they just haven't gotten around to adding it in yet. It is a little odd that they didn't put it in C++11, though; the aliasing problem is Hard without it. – Ray Nov 25 '15 at 21:39
1

@Ray The most recent proposal for `restrict`-like semantics in C++ is [N4150](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4150.pdf). – T.C. Nov 26 '15 at 05:05

Yaron Cohen-Tal · Answer 3 · 2015-11-25T08:48:30.063

24

Ok, guys, so I've measured, with GCC -O3 (using GCC 4.9 on Linux x64).

Turns out, the second version runs 54% faster!

So, I guess aliasing is the thing, I hadn't thought about it.

[Edit]

I've tried again the first version with all pointers defined with __restrict__, and the results are the same. Weird.. Either aliasing is not the problem, or, for some reason, the compiler doesn't optimize it well even with __restrict__.

[Edit 2]

Ok, I think I was pretty much able to prove that aliasing is the problem. I repeated my original test, this time using an array rather than a pointer:

const std::size_t n = 0x80000000ull;
bitmap->width = n;
static unsigned char d[n*3];
std::size_t i=0;
for (unsigned x = 0;  x < static_cast<unsigned>(bitmap->width);  ++x)
{
    d[i++] = 0xAA;
    d[i++] = 0xBB;
    d[i++] = 0xCC;
}

And measured (had to use "-mcmodel=large" to link it). Then I tried:

const std::size_t n = 0x80000000ull;
bitmap->width = n;
static unsigned char d[n*3];
std::size_t i=0;
unsigned width(static_cast<unsigned>(bitmap->width));
for (unsigned x = 0;  x < width;  ++x)
{
    d[i++] = 0xAA;
    d[i++] = 0xBB;
    d[i++] = 0xCC;
}

The measure results were the same - Seems like the compiler was able to optimize it by itself.

Then I tried the original codes (with a pointer p), this time when p is of type std::uint16_t*. Again, the results were the same - due to strict aliasing. Then I tried building with "-fno-strict-aliasing", and again saw a difference in time.

edited Nov 25 '15 at 08:48

answered Nov 24 '15 at 17:08

Yaron Cohen-Tal

2,005
1
15
26

4

This seems like it should be a comment, though it does technically answer the question. Also note, unfortunately you haven't demonstrated that aliasing was the thing. It seems likely, certainly plausible, but that's different than concluding that that was it. – SirGuy Nov 24 '15 at 18:14
@GuyGreer: See my [edit 2] - now I think it's pretty much proven. – Yaron Cohen-Tal Nov 25 '15 at 08:49
2

I just wonder why you started using the variable "i" the when you have "x" in your loop? – Jesper Madsen Nov 25 '15 at 21:37
1

Is it just me that finds the phrase *54% faster* difficult to comprehend? Do you mean it's 1.54 times the speed of the unoptimized one, or something else? – Roddy Nov 26 '15 at 17:15
@Roddy: It means it takes it 54% less time to run. e.g., if the first one takes 100 secs, the second one takes 46 secs. – Yaron Cohen-Tal Nov 26 '15 at 17:20
3

@YaronCohen-Tal so over twice as fast? Impressive, but not what I'd have understood "54% faster" to mean! – Roddy Nov 26 '15 at 20:30

plugwash · Answer 4 · 2015-11-25T18:23:44.360

24

Other answers have pointed out that hoisting the pointer operation out of the loop may change defined behaviour due to aliasing rules that allow char to alias anything and hence is not an allowable optimisation for a compiler even though in most cases it is obviously correct to a human programmer.

They have also pointed out that hoisting the operation out of the loop is usually but not always an improvement from a performance point of view and is often a negative from a readability point of view.

I would like to point out that there is often a "third way". Rather than counting up to the number of iterations you want you can count down to zero. This means that the number of iterations is only needed once at the start of the loop, it doesn't have to be stored after that. Better still at the assembler level it often eliminates the need for an explicit comparison as the decrement operation will usually set flags that indicate whether the counter was zero both before (carry flag) and after (zero flag) the decrement.

for (unsigned x = static_cast<unsigned>(bitmap->width);x > 0;  x--)
{
    *p++ = 0xAA;
    *p++ = 0xBB;
    *p++ = 0xCC;
}

Note that this version of the loop gives x values in the range 1..width rather than the range 0..(width-1). That doesn't matter in your case because you aren't actually using x for anything but it's something to be aware of. If you want a count down loop with x values in the range 0..(width-1) you can do.

for (unsigned x = static_cast<unsigned>(bitmap->width); x-- > 0;)
{
    *p++ = 0xAA;
    *p++ = 0xBB;
    *p++ = 0xCC;
}

You can also get rid of the casts in the above examples if you want without worrying about it's impact on comparison rules since all you are doing with bitmap->width is assigning it directly to a variable.

edited Nov 25 '15 at 18:23

answered Nov 25 '15 at 00:53

plugwash

9,724
2
38
51

2

I've seen that second case formatted as `x --> 0`, resulting in the "downto" operator. Pretty funny. P.S. I don't consider making a variable for the end condition to be a negative for readability, it can actually be the opposite. – Mark Ransom Nov 25 '15 at 13:02
It really depends, sometimes a statement gets so horrible that breaking it into multiple statements improves readability but I don't belive that to be the case here. – plugwash Nov 25 '15 at 13:27
1

+1 Good observation, though I would argue that hoisting the `static_cast(bitmap->width)` and using `width` instead in the loop is actually an improvement for readability because now there are less things for the reader to parse per line. Others' views may differ though. – SirGuy Nov 25 '15 at 21:11
1

There are many other situations where counting downwards is superior (e.g. when removing items from a list). I don't know why this isn't done more often. – Ian Goldby Nov 26 '15 at 08:39
3

If you want to write loops that look more like the optimal asm, use `do { } while()`, because in ASM you make loops with a conditional branch at the end. The usual `for(){}` and `while(){}` loops require extra instructions to test the loop condition once before the loop, if the compiler can't prove it always runs at least once. By all means, use `for()` or `while()` when it's useful to check if the loop should even run once, or when it's more readable. – Peter Cordes Nov 27 '15 at 08:59
@IanGoldby hmm, I would think that removing items from a list is a situation where it is better to count up. That way you only have to move each item once. – plugwash Dec 03 '15 at 15:21
@plugwash Counting downwards through a list means that you don't have to adjust the loop index each time you delete an item. In some languages this is important because you can't write to a loop index. In all cases it makes for simpler code, albeit not necessarily the most efficient. My point is that I don't agree that counting down makes code less readable. – Ian Goldby Dec 03 '15 at 15:44

score 11 · Answer 5 · edited May 23 '17 at 12:02

The only thing here that can prevent the optimization is the strict aliasing rule. In short:

"Strict aliasing is an assumption, made by the C (or C++) compiler, that dereferencing pointers to objects of different types will never refer to the same memory location (i.e. alias each other.)"

[…]

The exception to the rule is a char*, which is allowed to point to any type.

The exception also applies to unsigned and signed char pointers.

This is the case in your code: You're modifying *p through p which is an unsigned char*, so the compiler must assume that it could point to bitmap->width. Hence the caching of bitmap->width is an invalid optimization. This optimization-preventing behavior is shown in YSC's answer.

If and only if p pointed to a non-char and non-decltype(bitmap->width) type, would the caching be a possible optimization.

Roddy · Answer 6 · 2015-12-08T12:47:14.120

The question originally asked:

Is it worth optimizing it?

And my answer to that (garnering a good mix of both up and down votes..)

Let the compiler worry about it.

The compiler will almost certainly do a better job than you. And there's no guarantee that your 'optimization' is any better than the 'obvious' code - have you measured it??

More importantly, have you any proof that the code you're optimizing has any impact on the performance of your program?

Despite the downvotes (and now seeing the aliasing issue), I'm still happy with that as a valid answer. If you don't know if it's worth optimizing something, it probably isn't.

A rather different question, of course, would be this:

How can I tell if it's worth optimizing a fragment of code?

First, does your application or library need to run faster than it currently does? Is the user kept waiting too long? Does your software forecast yesterday's weather instead of tomorrow's?

Only you can really tell this, based on what your software is for and what your users expect.

Assuming your software does need some optimzation, the next thing to do is start measuring. Profilers will tell you where your code spends it's time. If your fragment isn't showing as a bottleneck, it's best left alone. Profilers and other measuring tools will also tell you if your changes have made a difference. It's possible to spend hours attemtping to optimize code, only to find you've made no discernible difference.

What do you mean by 'optimizing', anyway?

If you're not writing 'optimized' code, than your code should be as clear, clean and concise as you can make it. The "Premature optimization is evil" argument isn't an excuse for sloppy or inefficient code.

Optimized code normally sacrifices some of the attributes above for performance. It could involve introducing additional local variables, having objects with wider than expected scope or even reversing normal loop ordering. All of these may be less clear or concise, so document the code (briefly!) about why you're doing this.

But often, with 'slow' code, these micro-optimizations are the last resort. The first place to look is at algorithms and data structures. Is there a way of avoiding doing the work at all? Can linear searches be replaced with binary ones? Would a linked list be faster here than a vector? Or a hash table? Can I cache results? Making good 'efficient' decisions here can often affect performance by an order of magnitude or more!

When you're iterating over the width of a bitmap image, the looping logic can be a significant portion of the time spent in the loop. Rather than worrying about premature optimization, it's better in this case to develop best practices that are efficient from the start. — Mark Ransom, Nov 24 '15 at 16:16
@MarkRansom agreed, in part: But "best practices" would either be a: use an existing library or API call for filling images, or b: get the GPU to do it for you. It should never be the kind of unmeasured micro-optimization the OP suggests. And how do you know this code is ever executed more than once, or with bitmaps bigger then 16 pixels wide...? — Roddy, Nov 24 '15 at 16:27
@Veedrac. Appreciate the justification for the -1. The question's thrust has subtly and substantially changed since I answered. If you think the (expanded) answer is still unhelpful, time for me to delete it... "Is it worth..." is always primarily opinion-based, anyway. — Roddy, Dec 08 '15 at 12:50
@Roddy I appreciate the edits, they do help (and my comment probably sounded way too harsh anyway). I'm still on-the-fence though, since this is really an answer to a question that isn't suitable for Stack Overflow. It feels like a proper answer would be specific to the snippet, as the highly-voted answers here are. — Veedrac, Dec 08 '15 at 12:59

0kcats · Answer 7 · 2015-12-18T23:06:35.213

I use the following pattern in the situation like this. It is almost as short as the first case of yours, and is better than the second case, because it keeps the temporary variable local to the loop.

for (unsigned int x = 0, n = static_cast<unsigned>(bitmap->width); x < n; ++x)
{
  *p++ = 0xAA;
  *p++ = 0xBB;
  *p++ = 0xCC;
}

This will be faster with a less than smart compiler, debug build, or certain compilation flags.

Edit1: Placing a constant operation outside of a loop is a good programming pattern. It shows understanding of basics of machine operation, especially in C/C++. I'd argue that the effort to prove themselves should be on people that do not follow this practice. If compiler punishes for a good pattern, it is a bug in the compiler.

Edit2:: I've measured my suggestion against original code on vs2013, got %1 improvement. Can we do better? A simple manual optimization gives 3 times improvement over the original loop on x64 machine without resorting to exotic instructions. The code below assumes little endian system and properly aligned bitmap. TEST 0 is original (9 sec), TEST 1 is faster (3 sec). I bet somebody could make this even faster, and the result of the test would depend on the size of the bitmap. Definitely soon in the future, compiler will be able to produce consistently fastest code. I afraid this will be the future when the compiler will be also a programmer AI, so we would be out of work. But for now, just write code that shows that you know that extra operation in the loop is not needed.

#include <memory>
#include <time.h>

struct Bitmap_line
{
  int blah;
  unsigned int width;
  Bitmap_line(unsigned int w)
  {
    blah = 0;
    width = w;
  }
};

#define TEST 0 //define 1 for faster test

int main(int argc, char* argv[])
{
  unsigned int size = (4 * 1024 * 1024) / 3 * 3; //makes it divisible by 3
  unsigned char* pointer = (unsigned char*)malloc(size);
  memset(pointer, 0, size);
  std::unique_ptr<Bitmap_line> bitmap(new Bitmap_line(size / 3));
  clock_t told = clock();
#if TEST == 0
  for (int iter = 0; iter < 10000; iter++)
  {
    unsigned char* p = pointer;
    for (unsigned x = 0; x < static_cast<unsigned>(bitmap->width); ++x)
    //for (unsigned x = 0, n = static_cast<unsigned>(bitmap->width); x < n; ++x)
    {
      *p++ = 0xAA;
      *p++ = 0xBB;
      *p++ = 0xCC;
    }
  }
#else
  for (int iter = 0; iter < 10000; iter++)
  {
    unsigned char* p = pointer;
    unsigned x = 0;
    for (const unsigned n = static_cast<unsigned>(bitmap->width) - 4; x < n; x += 4)
    {
      *(int64_t*)p = 0xBBAACCBBAACCBBAALL;
      p += 8;
      *(int32_t*)p = 0xCCBBAACC;
      p += 4;
    }

    for (const unsigned n = static_cast<unsigned>(bitmap->width); x < n; ++x)
    {
      *p++ = 0xAA;
      *p++ = 0xBB;
      *p++ = 0xCC;
    }
  }
#endif
  double ms = 1000.0 * double(clock() - told) / CLOCKS_PER_SEC;
  printf("time %0.3f\n", ms);

  {
    //verify
    unsigned char* p = pointer;
    for (unsigned x = 0, n = static_cast<unsigned>(bitmap->width); x < n; ++x)
    {
      if ((*p++ != 0xAA) || (*p++ != 0xBB) || (*p++ != 0xCC))
      {
        printf("EEEEEEEEEEEEERRRRORRRR!!!\n");
        abort();
      }
    }
  }

  return 0;
}

You can save another 25% on 64bit if You use three int64_t instead of int64_t and int32_t. — Antonín Lejsek, Dec 20 '15 at 01:59

score 5 · Answer 8 · answered Nov 24 '15 at 16:23

There are two things to consider.

A) How often will the optimization run?

If the answer is not very often, like only when a user clicks a button, then don't bother if it makes your code unreadable. If the answer is 1000 times a second then you will probably want to go with the optimization. If it is even a bit complex be sure to put a comment in to explain what is going on to help the next guy that comes along.

B) Will this make the code harder to upkeep/troubleshoot?

If you're not seeing a huge gain in performance then making your code cryptic simply to save a few clock ticks is not a good idea. Lots of people will tell you that any good programmer should be able to look at the code and figure out what is going on. This is true. The problem is that in the business world the extra time figuring that out costs money. So, if you can make it prettier to read then do it. Your friends will thank you for it.

That said I'd personally use the B example.

score 4 · Answer 9 · answered Nov 24 '15 at 16:06

4

The compiler is able to optimize a lot of things. For your example, you should go for the readability, mantainability and what follows your code standard. For more information about what can be optimized (with GCC), see this blog post.

answered Nov 24 '15 at 16:06

Guillaume Racicot

39,621
9
77
141

score 4 · Answer 10 · answered Nov 24 '15 at 20:04

4

As a general rule, let the compiler do the optimization for you, until you determine that you should take over. The logic for this has nothing to do with performance, but rather with human readability. In the vast majority of cases, the readability of your program is more important than its performance. You should aim to write code which is easier for a human to read, and then only worry about optimization when you are convinced that performance is more important than the maintainability of your code.

Once you do see that performance matters, you should run a profiler on the code to determine which loops are being inefficient, and optimize those individually. There may indeed be cases where you want to do that optimization (especially if you migrate towards C++, where STL containers get involved), but the cost in terms of readability is great.

In addition, I can think of pathological situations where it could actually slow the code down. For example, consider the case where the compiler could not prove that bitmap->width was constant through the process. By adding the width variable you force the compiler to maintain a local variable in that scope. If, for some platform specific reason, that extra variable prevented some stack-space optimization, it may have to reorganize how it is emitting bytecodes, and produce something less efficient.

As an example, on Windows x64, one is obliged to call a special API call, __chkstk in the preamble of the function if the function will use more than 1 page of local variables. This function gives windows a chance to manage the guard pages they use to expand the stack when needed. If your extra variable pushes the stack usage up from below 1 page to at-or-above 1 page, your function is now obliged to call __chkstk every time it is entered. If you were to optimize this loop on a slow path, you may actually slow the fast path down more than you saved on the slow path!

Sure, it's a bit pathological, but the point of that example is that you can actually slow the compiler down. It just shows that you do have to profile your work to determine where the optimizations go. In the mean time, please don't sacrifice readability in any way for an optimization that may or may not matter.

answered Nov 24 '15 at 20:04

Cort Ammon

10,221
31
45

4

I wish C and C++ would provide more ways of explicitly identifying things the programmer doesn't care about. Not only would they provide more chances for compilers to optimize things, but it would save other programmers who read the code from having to guess whether e.g. it might be rechecking bitmap->width every time to ensure that changes to it affect the loop, or whether it might be caching bitmap->width to ensure that changes to it don't affect the loop. Having a means of saying "Cache this or not--I don't care" would make it clear the reason for the programmer's choice. – supercat Nov 24 '15 at 21:51
@supercat I agree wholeheartedly, as one can see if one looks at the piles of tatted failed languages I've sought to write to resolve this. I have found it is remarkably hard to define "what" someone doesn't care about without so much ungodly syntax that it just isn't worth it. I continue my search in vain. – Cort Ammon Nov 24 '15 at 21:55
It's not possible to define it in all cases, but I think there are a lot of cases where the type system could help. It's too C decided to make character types the "universal accessor" rather than having a type qualifier that was a bit looser than "volatile" that could be applied to *any* type, with the semantics that accesses of such types would be processed in sequence with accesses of the non-qualified equivalent type and also with accesses of all types of variables with that same qualifier. That would help make clear whether one was using character types because one needed the... – supercat Nov 24 '15 at 22:20
...aliasing behavior, or whether one was using them because they were the right size to fit one's needs. It would also be helpful to have explciit aliasing barriers which could in many cases be placed outside of loops, unlike the implicit barriers associated with character-type accesses. – supercat Nov 24 '15 at 22:20
1

This is a wise talk, but, generally, if you do already select C for your task, probably the performance is very important and the different rules should apply. Otherwise may be better to use Ruby, Java, Python or something the like. – Audrius Meškauskas Nov 25 '15 at 10:41
The compiler TRIES to do the optimization for you but you should not rely on it. Simple things like const int X = 25 are optimized without fail. However, if you're trying to solve the traveling salesman problem in linear time then I wouldn't count on the compiler to optimize anything you're doing. – soulsabr Nov 25 '15 at 14:39
@soulsabr Correct. Which is why, when you profile your work, and discover that 80% of your CPU time is spent in one small section, you hand-optimize that part. You might even drop down to hand-coded assembly if need be. However, in my experience there are very few sections of code in which performance is more important than readability. Loss of readability is very expensive. – Cort Ammon Nov 25 '15 at 18:35
Your pathological example is *really* unlikely, as you probably know. Creating a local variable to hold a temporary result for a small section of code usually has no cost. Usually the compiler can keep it in a register, and never needs to make space on the stack for it since it's never spilled. You're right that spilling it to the stack instead of just later re-loading from a pointer (or a global) makes for bigger and slower code, esp. if it has to sub/add from the stack pointer when it wouldn't otherwise in a function (really unlikely, threshold of spilling the first reg). – Peter Cordes Nov 27 '15 at 09:06
Your advice is good when one has to build fast and sell the code, and never go back to it, or charge money for each improvement, or run the code only a few times. If you write the code that you have to maintain and show to other developers, it is better to spend a little time on making it clean and show some care. – 0kcats Dec 18 '15 at 23:25
@0kcats Can you explain what you mean? My " In the vast majority of cases, the readability of your program is more important than its performance. You should aim to write code which is easier for a human to read, and then only worry about optimization when you are convinced that performance is more important than the maintainability of your code." seems to line up exactly with your "making it clean and show some care." – Cort Ammon Dec 18 '15 at 23:48
@Cort Ammon If a dev recognizes an operation, that can be moved out of a loop, they should do that as a habit. Even if it makes it better by 1%. In this example, the code has not become less readable, but you seem to advice against a better choice (let even hypothetically better). I am not advocating of micro optimizations, but for obtaining and maintaining good habits. If bitmap->width were a function call (it actually could be), and if you move it out, you would not need to care and measure this loop ever just because you got a habit of doing the right choice. – 0kcats Dec 19 '15 at 03:21
@0kcats Ahh, you'll note I took great care *not* to claim "moving things out of the loop makes it less readable," simply that you should be far more concerned with readability than optimization. Readability is a very complicated topic that includes not only personal preference but the preferences of the company you work for. The OP was concerned with doing that operation for optimization reasons, not readability. It's up to each group of developers to determine what readable is. Doing it for the sake of optimization is what i'm opposed to. – Cort Ammon Dec 19 '15 at 03:25
I've seen code where those "easily readable" changes grew over time and actually obscured a bug that took quite a long time to fix because it involved too much syntax so the eye glosses over it. If you're arguing they are more readable, great, do it! But don't do it just for optimization sake. (The particular bug I remember was one which dealt with optimizing end iterators by calling end() once at the start of a loop, rather than every loop. Someone misplaced an iterator along the way and made an invalid assumption about the state of this "endIter" variable that was getting reused) – Cort Ammon Dec 19 '15 at 03:28
@0kcats Then consider yourself lucky not not have the misfortune of having to find most bugs in hard to read code. It's really taxing. – Cort Ammon Dec 19 '15 at 03:46

score 4 · Answer 11 · edited May 23 '17 at 11:47

The comparison is wrong since the two code snippets

for (unsigned x = 0;  x < static_cast<unsigned>(bitmap->width);  ++x)

and

unsigned width(static_cast<unsigned>(bitmap->width));
for (unsigned x = 0;  x<width ;  ++x)

are not equivalent

In the first case width is dependent and not const, and one cannot assume that it may not change between subsequent iterations. Thus it cannot be optimized, but has to be checked at every loop.

In your optimized case a local variable is assigned the value of bitmap->width at some point during program execution. The compiler can verify that this does not actually change.

Did you think about multi threading , or maybe the value could be externally dependent such that its value is volatile. How would one expect the compiler to figure all these things out if you do not tell?

The compiler can only do as good as your code lets it.

Vinayak S M · Answer 12 · 2015-12-03T11:33:01.187

2

Unless you know how exactly the compiler optimizes the code, it is better to do your own optimizations by keeping the code readability, and design. Practically it is hard to check assembly code for every function we write for new compiler versions.

edited Dec 03 '15 at 11:33

answered Dec 03 '15 at 11:27

Vinayak S M

49
4

score 1 · Answer 13 · answered Dec 21 '15 at 14:57

Compiler cannot optimize bitmap->width because value of width can be changed between iterations. There are a few most common reasons:

Multi-threading. Compiler cannot predict if other thread is about to change value.
Modification inside loop, sometimes it is not simple to tell if variable will be changed inside loop.
It is function call, e.g. iterator::end() or container::size() so it is hard to predict if it will always returns the same result.

To sum up (my personal opinion) for places that requires high level of optimization you need to do that by yourself, in other places just leave it, compiler may optimize it or not, if there is no big difference code readability is main target.

In C++, should I bother to cache variables, or let the compiler do the optimization? (Aliasing)

13 Answers13

Source `unoptimized.cpp`

Source `optimized.cpp`

Compilation

Assembly (unoptimized.s)

Assembly (optimized.s)

diff

Linked

In C++, should I bother to cache variables, or let the compiler do the optimization? (Aliasing)

13 Answers13

Source unoptimized.cpp

Source optimized.cpp

Compilation

Assembly (unoptimized.s)

Assembly (optimized.s)

diff

Linked

Source `unoptimized.cpp`

Source `optimized.cpp`