1

So given the following c program:

#include <stdio.h>

int main() {
    int a = 3;
    printf("hello world %d\n", a);
}

Running it with clang x86-64 v6.0 produces the following assembly with no optimisations:

main: # @main
  pushq %rbp
  movq %rsp, %rbp
  subq $16, %rsp
  movabsq $.L.str, %rdi
  movl $3, -4(%rbp)
  movl -4(%rbp), %esi
  movb $0, %al
  callq printf
  xorl %esi, %esi
  movl %eax, -8(%rbp) # 4-byte Spill
  movl %esi, %eax
  addq $16, %rsp
  popq %rbp
  retq
.L.str:
  .asciz "hello world %d\n"

And I noticed that with this C program

#include <stdio.h>

int main() {
    double a = 3;
    printf("hello world %f\n", a);
}

The program produces the similar assembly:

.LCPI0_0:
  .quad 4613937818241073152 # double 3
main: # @main
  pushq %rbp
  movq %rsp, %rbp
  subq $16, %rsp
  movabsq $.L.str, %rdi
  movsd .LCPI0_0(%rip), %xmm0 # xmm0 = mem[0],zero
  movsd %xmm0, -8(%rbp)
  movsd -8(%rbp), %xmm0 # xmm0 = mem[0],zero
  movb $1, %al    ; <--- this
  callq printf
  xorl %ecx, %ecx ; <--- this
  movl %eax, -12(%rbp) # 4-byte Spill
  movl %ecx, %eax
  addq $16, %rsp
  popq %rbp
  retq
.L.str:
  .asciz "hello world %f\n"

however there are two differences:

  • the xmm0 SSE register things are used - i understand this is to do with floating point
  • we xor ECX rather than RSI after the call
  • and AL is set to 1

What do these differences mean?

flooblebit
  • 477
  • 1
  • 3
  • 9
  • 2
    1) Yes 2) Doesn't matter. Neither of them is actually needed (it's a temporary for the implicit `return 0`) but you didn't turn optimization on so you are looking at garbage code 3) As per the ABI, you need to set `AL` to the number of SSE registers used. That is one, here. – Jester Apr 08 '18 at 17:28
  • @Jester Ok this makes sense thank you! – flooblebit Apr 08 '18 at 17:42
  • 1
    Compile with `-O3` to optimize away the silly parts of this code and leave the interesting parts. I'm surprised at `movabsq $.L.str, %rdi`, though; IDK why clang/LLVM would ever emit that instead of a RIP-relative LEA. Maybe clang considers using `mov $.L.str, %edi` (5 bytes), but with optimization disabled it doesn't end up optimizing to a RIP-relative LEA or a zero-extended mov-immediate. – Peter Cordes Apr 08 '18 at 18:20
  • @PeterCordes Ah I'm writing a compiler so I like to see the code for simple things to compare but when I turn on optimisations most of the time it's so optimised there isn't really anything to it, it's as if clang runs my simple programs and inlines all of the results... that is a question I had though, why does it use movabsq and not lea... is a lea faster? – flooblebit Apr 08 '18 at 18:28
  • A 7-byte LEA decodes faster, and fetches faster from the uop cache in Sandybridge-family. 10-byte `movabsq` with a 64-bit immediate has to be handled specially in the uop cache, which can slow things down. http://agner.org/optimize/. The best option if making position-dependent code for the default code-model on Linux is 5-byte `mov r32,imm32` (to implicitly zero-extend to 64-bits), because all static symbols are in the low 2GiB. On OS X, the image base address is above 4G, so RIP-relative LEA is the best option even if you don't care about making it position-independent. – Peter Cordes Apr 08 '18 at 18:32
  • 1
    Of course `-O3` inlines the answer when you give it compile-time-constant inputs. That makes the code run faster and be smaller. If you want to see code for runtime-variable inputs, write functions like `int foo(int x, int y) { return x + y; }`. You don't need or want a `main` function, because you don't want to run it, just look at asm for a function. See also https://stackoverflow.com/questions/38552116/how-to-remove-noise-from-gcc-clang-assembly-output. Don't try to be like `clang -O0`, spilling to memory after every statement is for debugging, and is horrible. – Peter Cordes Apr 08 '18 at 18:34
  • @PeterCordes Amazing, this is really interesting. Thank you. I don't wan to deviate too far from my question as per the SO rules... but any resources where I can look into further detail w/r/t these kinds of things? – flooblebit Apr 08 '18 at 18:35
  • Yes, https://stackoverflow.com/tags/x86/info has a performance section with more links to stuff about making asm that doesn't suck, and about how various x86 microarchitectures work internally and how to optimize for them. – Peter Cordes Apr 08 '18 at 18:35
  • @PeterCordes You've been a great help thank you! – flooblebit Apr 08 '18 at 18:40

0 Answers0