When cross-compiling using clang
and the -target
option, targeting the same architecture and hardware as the native system, I've noticed that clang
seems to generate worse optimizations than the native-built counter-part for cases where the <sys>
in the triple is none
.
Consider this simple code example:
int square(int num) {
return num * num;
}
When optimized at -O3
with -target x86_64-linux-elf
, the native x86_64
target code generation yields:
square(int):
mov eax, edi
imul eax, edi
ret
The code generated with -target x86_64-none-elf
yields:
square(int):
push rbp
mov rbp, rsp
mov eax, edi
imul eax, edi
pop rbp
ret
Despite having the same hardware and optimization flags, clearly something is missing an optimization. The problem goes away if none
is replaced with linux
in the target triple, despite no system-specific features being used.
At first glance it may look like it simply isn't optimizing at all, but different code segments show that it is performing some optimizations, just not all. For example, loop-unrolling is still occurring.
Although the above examples are simply with x86_64, in practice, this issue is generating code-bloat for an armv7
-based constrained embedded system, and I've noticed several missed optimizations in certain circumstances such as:
- Not removing unnecessary setup/cleanup instructions (same as in x86_64)
- Not coalescing certain sequential inlined increments into a single
add
instruction (at-Os
, when inliningvector
-likepush_back
calls. This optimizes when built natively from an arm-based system running armv7.) - Not coalescing adjacent small integer values into a single
mov
(such as merging a 32-bitint
with abool
in anoptional
implementation. This optimizes when built natively from an arm-based system running armv7.) - etc
I would like to know what I can do, if anything, to achieve the same optimizations when cross-compiling as compiling natively? Are there any relevant flags that can help tweak tuning that are somehow implied with the <sys>
in the triple?
If possible, I'd love some insight as to why the cross-compilation target appears to fail to optimize certain things simply because the system is different, despite having the same architecture and abi. My understanding is that LLVM's optimizer operates on the IR, which should generate effectively the same optimizations so long as nothing is reliant on the target system itself.