The function as written is neither optimal nor portable. That having been said, the authors of C89 included some badly-written rules which if interpreted strictly make C89 a much less powerful language than the earlier C dialects that existed on many platforms. The stated purpose of the rule is to avoid requiring a C compilers which given code like:
float *fp;
int i;
int foo(void)
{
i++;
*fp=1.0f;
return i;
}
from having to pessimistically assume that the write to *fp
might affect i
despite the complete absence of anything that would suggest that something of type int
might be affected. Code which used type punning for a variety of purposes including chunking optimizations was widespread when C89 was written, but most such code would have included clear indications to the compiler that aliasing was going to occur. Generally, if an object was going to be modified by a foreign-type pointer between two "normal" accesses, one or both of the following would occur between the two normal accesses:
The object's address would be taken.
A pointer would be converted from the object's type to another type.
Aside from the obvious case where a pointer is used to access an object of
its precise type, the Standard mostly identifies cases where it would not be
obvious that a compiler should assume aliasing was possible (e.g. between
an "int" and a pointer of type "unsigned*", or between anything and a pointer
of type "char*"). Given the rationale, I think the authors intended to
focus on making sure that compiler writers handled the cases where there
would be no reason to expect aliasing, but didn't think they needed to be
told how to identify cases where it was obviously likely.
Chunking optimizations will be safe on compilers which recognize that the
address-of and casting operators imply that cross-type aliasing is likely,
provided that all use of the pointer resulting from a cast is performed
before the next access using the non-cast pointer--a requirement that most
chunking code would generally meet. Unfortunately, there is no standard
for "sane-compiler C", and gcc uses the fact that the authors of the
Standard didn't require that compilers handle the obvious-aliasing cases as
justification to ignore them.
Still, the performance advantages from chunking optimizations may outweigh
the performance costs of -fno-strict-aliasing
, especially if code uses
restrict
qualifiers when appropriate. There are some situations, mainly
involving global variables, where restrict
isn't sufficient to enable
useful optimizations; those could be handled by an aliasing mode which limits
analysis to objects of static or automatic duration (like the objects in the
rationale's example) but gcc offers no such mode.
BTW, I'm not sure what instruction timings are like on modern x86 processors, but on some ARM variants compilers would have a chance at producing optimal long-string code from something like:
uint32_t x01000000 = 0x01000000;
uint64_t *search(uint64_t *p)
{
uint64_t n;
uint32_t a,b;
uint32_t v = x01000000; // See note
do
{
n=*p++;
a=(uint32_t)n;
b=n>>32;
if (a >= v || (a << 8) >= v || (a << 16) >= v || (a << 24) >= v ||
b >= v || (b << 8) >= v || (b << 16) >= v || (b << 24) >= v) return p;
} while(1);
}
Figuring out which comparison was responsible for breaking out of the loop would take extra time, but keeping such considerations out of the loop may allow the loop itself to be more efficient.
Many ARM processors have an instruction which can compare a shifted value with a register; compilers sometimes need a little help to realize that 0x01000000 should be kept in a register (a compare-with-constant instruction exists, but
doesn't include a "free" shift of the register being compared), but with help they can find the compare-with-shift. I haven't yet found a way to convince a compiler to generate optimal code for the ARM7-TDMI, which would be equivalent to:
search:
mov r1,#0x010000000
lp:
ldrmia r0,{r2,r3}
cmp r1,r2
cmplt r1,r2,asl #8
cmplt r1,r2,asl #16
cmplt r1,r2,asl #24
cmplt r1,r3
cmplt r1,r3,asl #8
cmplt r1,r3,asl #16
cmplt r1,r3,asl #24
blt lp
bx lr
That would take 15 cycles per eight bytes; it could be adapted to take 25 cycles per sixteen bytes. A loop that processes eight bytes individually would take 42 cycles; unrolled to sixteen bytes would be 82 cycles. The best loop I've seen compilers generate for the uint64_t based code would be 22 cycles for eight bytes--almost half again as long as the optimal code, but still about twice as fast as a version using bytes.