Hard as it may be to believe the construct p[u+1]
occurs in several places in innermost loops of code I maintain such that getting the micro optimization of it right makes hours of difference in an operation that runs for days.
Typically *((p+u)+1)
is most efficient. Sometimes *(p+(u+1))
is most efficient. Rarely *((p+1)+u)
is best. (But usually an optimizer can convert *((p+1)+u)
to *((p+u)+1)
when the latter is better, and can't convert *(p+(u+1))
with either of the others).
p
is a pointer and u
is an unsigned. In the actual code at least one of them (more likely both) will already be in register(s) at the point the expression is evaluated. Those facts are critical to the point of my question.
In 32-bit (before my project dropped support for that) all three have exactly the same semantics and any half decent compiler simply picks the best of the three and the programmer never needs to care.
In these 64-bit uses, the programmer knows all three have the same semantics, but the compiler doesn't know. So far as the compiler knows, the decision of when to extend u
from 32-bit to 64-bit can affect the result.
What is the cleanest way to tell the compiler that the semantics of all three are the same and the compiler should select the fastest of them?
In one Linux 64-bit compiler, I got nearly there with p[u+1L]
which causes the compiler to select intelligently between the usually best *((p+u)+1)
and the sometimes better *(p+( (long)(u) + 1) )
. In the rare case *(p+(u+1))
was still better than the second of those, a little is lost.
Obviously, that does no good in 64-bit Windows. Now that we dropped 32-bit support, maybe p[u+1LL]
is portable enough and good enough. But can I do better?
Note that using std::size_t
instead of unsigned
for u
would eliminate this entire problem, but create a larger performance problem nearby. Casting u
to std::size_t
right there is almost good enough, and maybe the best I can do. But that is pretty verbose for an imperfect solution.
Simply coding (p+1)[u]
makes a selection more likely to be optimal than p[u+1]
. If the code were less templated and more stable, I could set them all to (p+1)[u]
then profile then switch a few back to p[u+1]
. But the templating tends to destroy that approach (A single source line appears in many places in the profile adding up to serious time, but not individually serious time).
Compilers that should be efficient for this are GCC, ICC and MSVC.