14

Hard as it may be to believe the construct p[u+1] occurs in several places in innermost loops of code I maintain such that getting the micro optimization of it right makes hours of difference in an operation that runs for days.

Typically *((p+u)+1) is most efficient. Sometimes *(p+(u+1)) is most efficient. Rarely *((p+1)+u) is best. (But usually an optimizer can convert *((p+1)+u) to *((p+u)+1) when the latter is better, and can't convert *(p+(u+1)) with either of the others).

p is a pointer and u is an unsigned. In the actual code at least one of them (more likely both) will already be in register(s) at the point the expression is evaluated. Those facts are critical to the point of my question.

In 32-bit (before my project dropped support for that) all three have exactly the same semantics and any half decent compiler simply picks the best of the three and the programmer never needs to care.

In these 64-bit uses, the programmer knows all three have the same semantics, but the compiler doesn't know. So far as the compiler knows, the decision of when to extend u from 32-bit to 64-bit can affect the result.

What is the cleanest way to tell the compiler that the semantics of all three are the same and the compiler should select the fastest of them?

In one Linux 64-bit compiler, I got nearly there with p[u+1L] which causes the compiler to select intelligently between the usually best *((p+u)+1) and the sometimes better *(p+( (long)(u) + 1) ). In the rare case *(p+(u+1)) was still better than the second of those, a little is lost.

Obviously, that does no good in 64-bit Windows. Now that we dropped 32-bit support, maybe p[u+1LL] is portable enough and good enough. But can I do better?

Note that using std::size_t instead of unsigned for u would eliminate this entire problem, but create a larger performance problem nearby. Casting u to std::size_t right there is almost good enough, and maybe the best I can do. But that is pretty verbose for an imperfect solution.

Simply coding (p+1)[u] makes a selection more likely to be optimal than p[u+1]. If the code were less templated and more stable, I could set them all to (p+1)[u] then profile then switch a few back to p[u+1]. But the templating tends to destroy that approach (A single source line appears in many places in the profile adding up to serious time, but not individually serious time).

Compilers that should be efficient for this are GCC, ICC and MSVC.

stgatilov
  • 5,333
  • 31
  • 54
JSF
  • 5,281
  • 1
  • 13
  • 20
  • `Typically *((p+u)+1) is most efficient.` have you measured it? seems like bad profiling. I don't expect any performance difference. – David Haim Dec 29 '15 at 13:46
  • I have just switched on my linux machine , I am checking the same @JSF . Will see how it goes . – SACHIN GOYAL Dec 29 '15 at 13:50
  • @DavidHaim, yes I have measured across the original change. If you know x86-64 assembler, you should know why `*((p+u)+1)` is typically most efficient. – JSF Dec 29 '15 at 13:56
  • [Comparison on godbolt](https://goo.gl/8iIiPj). This suggests that anything is better than `p[u + 1]` :-( I'd go with `*(p + u + 1)` in that case. – Kerrek SB Dec 29 '15 at 14:00
  • @SACHINGOYAL, In most people's code this effect, even when it exists, will be tiny enough that it is hard to measure. I can't give you the code in which it matters. But factors required for this to matter include: The location being accessed must be in cache. `p` must be stable in a register over significant surrounding code. `u` must be updated in a register locally (but trivially such that the computation time doesn't mask this performance difference). `u` must still be needed (such as for the next update to `u`) after this expression. – JSF Dec 29 '15 at 14:01
  • @Kerrek, The whole discussion is about `unsigned u`. Using signed (as you did) is a different (and worse) cluster of performance problems. – JSF Dec 29 '15 at 14:03
  • @KerrekSB ,JSF so according to the output, the "clean,signed version" of x[y] only adds one assemly instruction over the rest. that also means that the expression has to be executed milliards of time in order to show any real difference .. – David Haim Dec 29 '15 at 14:03
  • @DavidHaim, if there is a `movsx` in the generated code, the performance is unacceptable. The expression is certainly executed millions of millions of times. – JSF Dec 29 '15 at 14:06
  • @JSF: Ah, I missed that detail. Maybe you can make that clearer in your question (not just the title!)? – Kerrek SB Dec 29 '15 at 14:07
  • 1
    @KerrekSB, also in real code the optimizer knows the 64 bit register holding the 32-bit unsigned has had its top half cleared for free by the previous instruction. In an artificial test the optimizer won't know that. – JSF Dec 29 '15 at 14:09
  • I haven't tried (and don't need) the bizarre example `u ^= p[u+1]` but my intuition is that something like that (repeated) would be a good example of the case that `u ^= (p+1)[u]` is faster (even if you stop the optimizer from promoting `(p+1)` out of the loop). – JSF Dec 29 '15 at 14:22
  • 4
    I can believe that this micro-optimization can save hours (1 or 2) in a run that takes days (10 or 20) (in other words ½%) - but is it the best use of developer time? A change that improved cache locality (for example would give a much more significant improvement. – Martin Bonner supports Monica Dec 29 '15 at 15:21
  • 4
    Can this question be answered? The code in question is proprietary and can't be shared (and may be too large for a sane question) but has many artificial constraints that seriously affects appropriate answers to the question. I don't think there is a way to teach the optimizer, and I'm not sure any optimal answer can be given without access to the code. – Cornstalks Dec 29 '15 at 15:52
  • In some important examples, this was the difference between starting Friday afternoon and finishing Monday morning vs starting Friday afternoon and finishing Monday afternoon. And **that** was the difference between a major customer being happy vs. unhappy. – JSF Dec 29 '15 at 15:53
  • @Cornstalks, I don't really know if it can be answered. But the proprietary code should not be a significant factor. If there is a good syntax to let the compiler choose the more efficient of `(p+1)[u]` or `p[u+1]` that should be independent of why doing so matters a lot. It should be answerable anywhere that it matters even a little. – JSF Dec 29 '15 at 15:56
  • Could you use something like [`__assume` or `__builtin_unreachable`](http://stackoverflow.com/q/23709259/1287251) to tell the compiler `&(p+1)[u] == &p[u+1]`? I'm not sure where you'd have to put that or how many times you'd have to repeat it. – Cornstalks Dec 29 '15 at 16:00
  • @Cornstalks I guess I should experiment. My expectation is that optimizers are not smart enough to use this information given that way. But maybe I'm too pessimistic. – JSF Dec 29 '15 at 16:02
  • Then perhaps tell the compiler that `u < UINT_MAX`? I assume that's the reason the optimizer does not assume the two statements are the same. – Cornstalks Dec 29 '15 at 16:05
  • Telling the compiler `u < N`, where `N` was a compile time constant and the compiler knew `N*sizeof(*p) – JSF Dec 29 '15 at 16:15
  • Oh god, is this really about unsigned overflow? *censored* C/C++. – Karoly Horvath Dec 29 '15 at 16:33
  • 1
    @KarolyHorvath, yes I thought that was clear. It is **really** about telling the compiler that the standard mod `2**N` behavior of `unsigned` can be ignored **HERE** (even though it is depended on elsewhere). (If there were a more general way to tell a compiler "do this or that, whichever is faster", I would be far happier with such an answer. But for the question I asked, just making the optimizer **understand** `u+1>u` would do it.) – JSF Dec 29 '15 at 16:52
  • If this is so localized and crucial, I wouldn't waste so much time in trying to figure out the exact magic to make the compiler emit the assembly you want (with the risk of having everything break down at the next compiler/library update). If it's viable, I would take the generated assembly, fix it with the fastest sequence found and keep it as inline assembly. – Matteo Italia Dec 29 '15 at 17:26
  • @MatteoItalia It is only localized to a few places in heavily templated tiny functions that are inlined to thousands of places. Inline assembly would break the templating and making the decision by source location would make it wrong for many inline locations. – JSF Dec 29 '15 at 17:36
  • I would bet that simply rewriting some of the code could produce benefits that far outweigh this concern. As @MartinBonner mentioned, considering cache locality would be one thing to look at. Manually unrolling some of those loops may also make a significant difference, and delving into assembly as Matteo Italia suggested would be another. – Nick Dec 29 '15 at 17:51
  • I have to agree with Cornstalks, unless you can reproduce the *essence* of the problem in a testable code, there's really nothing we can do. – Karoly Horvath Dec 29 '15 at 21:48
  • @Karoly: I think we can look at Kerrek's godbolt post and accept anything that produces a single-instruction load (after a `mov edi,edi` to zero the upper32). We know the upper32-zeroing will go away when inlining. **JSF:** `1ULL` [looks optimal for 64bit, and for 32bit with clang](https://goo.gl/3qaa8H), but not gcc except in one case, (but you don't care about 32b anyway). – Peter Cordes Dec 30 '15 at 05:47
  • First of all, I think `sizeof(*p)` is very important to understand what is going on. Also, the timings differences may come from unrolling or vectorization, which makes it painful to understand what is going on. So the OP should probably generate assembly output analyse the differences. Unfortunately, optimizing small benchmark code is pretty useless here. – stgatilov Dec 30 '15 at 06:22
  • 1
    What is the type of `p` in your code? The generated assembly differs based on the size of type being pointed to, of course, so to give the best targeted answer we need to know this. – M.M Dec 30 '15 at 06:47
  • Interesting to [compare gcc with clang](http://goo.gl/LxM6Tr) ... clang seems to have hit on a simpler option for the original code (`f1` in my example) , and both compilers miss an obvious opportunity to optimize `f2` – M.M Dec 30 '15 at 06:57

1 Answers1

2

The answer is inevitably compiler and target specific, but even if 1ULL is wider than a pointer on whatever target architecture, a good compiler should optimize it away. Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? explains why a wider computation truncated to pointer width will give identical results as doing computation with pointer width in the first place. This is why compilers can optimize it away even on 32bit machines (or x86-64 with the x32 ABI) when 1ULL leads to promotion of the + operands to a 64bit type. (Or on some 64bit ABI for some architecture where long long is 128b).


1ULL looks optimal for 64bit, and for 32bit with clang. You don't care about 32bit anyway, but gcc wastes an instruction in the return p[u + 1ULL];. All the other cases are compiled to a single load with scaled-index+4+p addressing mode. So other than one compiler's optimization failure, 1ULL looks fine for 32bit as well. (I think it's unlikely that it's a clang bug and that optimization is illegal).

int v1ULL(std::uint32_t u) { return p[u + 1ULL]; }
//   ...  load u from the stack
//    add     eax, 1
//    mov     eax, DWORD PTR p[0+eax*4]

instead of

    mov     eax, DWORD PTR p[4+eax*4]

Interestingly, gcc 5.3 doesn't make this mistake when targeting the x32 ABI (long mode with 32bit pointers and a register-call ABI similar to SySV AMD64). It uses a 32bit address-size prefix to avoid using the upper 32b of edi.

Annoyingly, it still uses an address-size prefix when it could save a byte of machine code by using a 64bit effective address (when there's no chance of overflow/carry into the upper32 generating an address outside the low 4GiB). Passing the pointer by reference is a good example:

int x2   (char *&c) { return *c; }
//    mov     eax, DWORD PTR [edi]  ; upper32 of rax is zero
//    movsx   eax, BYTE PTR [eax]   ; could be byte [rax], saving one byte of machine code

Err, actually I forget. 32bit addresses might sign-extend to 64b, not zero-extend. If that's the case, it could have used movsx for the first instruction, too, but that would have cost a byte because movsx has a longer opcode than mov.

Anyway, x32 is still an interesting choice for pointer-heavy code that wants more registers and a nicer ABI, without the cache-miss hit of 8B pointers.


The 64bit asm has to zero the upper32 of the register holding the parameter (with mov edi,edi), but that goes away when inlining. Looking at godbolt output for tiny functions is a valid way to test this.

If we want to make doubly sure that the compiler isn't shooting itself in the foot and zeroing the upper32 when it should know it's already zero, we could make test functions with an arg passed by reference.

int v1ULL(const std::uint32_t &u) { return p[u + 1ULL]; }
//  mov     eax, DWORD PTR [rdi]
//  mov     eax, DWORD PTR p[4+rax*4]
Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    Why can't you use `size_t(1)`? This does not rely on optimizer and is OK for both 32-bit and 64-bit. – stgatilov Dec 30 '15 at 06:37
  • 1
    @stgatilov OP said that he found `size_t(1)` unaesthetic ... but it's probably the best option. A potential drawback of `1LL` is that it doesn't work prior to C++11 (or at least, requires non-standard extensions), and OP seemed to be implying he was working with some legacy code. – M.M Dec 30 '15 at 06:41
  • @M.M: JSF wrote that using `size_t` for `u` is not viable, because it creates more performance problems elsewhere. Anyway, if `size_t(1)` is not pretty enough, write `#define I ((std::size_t)1)` and be happy. – stgatilov Dec 30 '15 at 06:45
  • 1
    `I` is a reserved identifier (a complex number) .. but other than that, good idea :) – M.M Dec 30 '15 at 07:00