5

I've made a function to calculate the length of a C string (I'm trying to beat clang's optimizer using -O3). I'm running macOS.

_string_length1:
    push rbp
    mov rbp, rsp
    xor rax, rax
.body:
    cmp byte [rdi], 0
    je .exit
    inc rdi
    inc rax
    jmp .body
.exit:
    pop rbp
    ret

This is the C function I'm trying to beat:

size_t string_length2(const char *str) {
  size_t ret = 0;
  while (str[ret]) {
    ret++;
  }
  return ret;
}

And it disassembles to this:

string_length2:
    push rbp
    mov rbp, rsp
    mov rax, -1
LBB0_1:
    cmp byte ptr [rdi + rax + 1], 0
    lea rax, [rax + 1]
    jne LBB0_1
    pop rbp
    ret

Every C function sets up a stack frame using push rbp and mov rbp, rsp, and breaks it using pop rbp. But I'm not using the stack in any way here, I'm only using processor registers. It worked without using a stack frame (when I tested on x86-64), but is it necessary?

SpilledMango
  • 575
  • 7
  • 25
  • 4
    A related question: [here](https://stackoverflow.com/q/14666665/509868) – anatolyg Jul 24 '17 at 15:32
  • 7
    Whether or not the frame pointer is required is dictated by the platform ABI. In general, no, it is not required, and so optimizing compilers eliminate this code. That includes Clang, so it's somewhat surprising to see that your disassembly includes it. If I remember correctly, Michael Petch investigated this a while back and determined that, on macOS, the frame pointer is typically used, so Clang does *not* eliminate it like it normally would on other platforms. I can't remember if it is *required* by the macOS ABI, or just customary. Are you, in fact, on macOS? – Cody Gray - on strike Jul 24 '17 at 15:37
  • Also note that compiler may inline its own `strlen` (or even optimize to a constant) while this can not be inlined. I would expect the two `inc` instructions in the loop to be also quite bad. Furthermore, depending on expected string length, there are other optimizations to be had :) – Jester Jul 24 '17 at 15:38
  • Could you post the C source used to produce the assembly code you want to compete against? Then the question would make even more sense. – Jabberwocky Jul 24 '17 at 15:46
  • If you inline the function, you effectively achieve the same result in C/C++. – David Hoelzer Jul 24 '17 at 15:52
  • 3
    @Michael Be careful making the question into something it is not. The question isn't, "how can I optimize this code further?" It is, as the title says, is the stack frame required on x86-64? Both are valid questions, of course, but only one question should be asked/answered at a time. So we don't really need the source code to determine whether a stack frame is required. – Cody Gray - on strike Jul 24 '17 at 15:54
  • Maybe related: https://stackoverflow.com/questions/14666665/trying-to-understand-gcc-option-fomit-frame-pointer – Support Ukraine Jul 24 '17 at 15:56
  • @CodyGray it's just for completeness. – Jabberwocky Jul 24 '17 at 15:56
  • By the way, without setting up the stack frame, my Assembly function is faster than the C function. – SpilledMango Jul 24 '17 at 16:02
  • @SpilledMango it should, you eliminate memory write/read. You might even get better if you do jumps instead of call/ret. – Serge Jul 24 '17 at 16:06
  • 1
    @Serge But then I have to know which function to jump back to. The return address could be stored in a processor register though, instead of the stack. But C always uses the `call` instruction when calling functions. – SpilledMango Jul 24 '17 at 16:09
  • No, in your example assembly code the stackframe is not necessary. Any stackwalking tools might get confused though, but those are not part of your [release] code. – Paul Ogilvie Jul 24 '17 at 16:49
  • @SpilledMango you cannot do it in 'c' directly. But if you are after this kind of optimizations, you can insert assembly code int he 'c' program. The other solution would be to always inline this function (you would probably need a c++ compilation there) or implement it as a macro. – Serge Jul 24 '17 at 16:51
  • FYI, this is surely an ABI or default settings thing, as on my Linux machine clang 3.8 at -O3 doesn't touch the stack when compiling that function (the assembly is identical, besides the rows that cite rbp, which are missing). – Matteo Italia Jul 24 '17 at 17:55
  • Off-topic: It should be easy to beat clang's loop if you check multiple bytes at a time. Use integer bit-hacks, or much much better use SSE2 `pcmpeqb` / `pmovmskb` / `test`/`jnz`, or an unrolled version of that. (Then `bsf` that mask to find where the first zero was). See https://stackoverflow.com/questions/37800739/is-it-safe-to-read-past-the-end-of-a-buffer-within-the-same-page-on-x86-and-x64 https://stackoverflow.com/questions/25566302/vectorized-strlen-getting-away-with-reading-unallocated-memory. Look at the actual asm strlen implementations in OS X's libc, and especially glibc. – Peter Cordes Jul 27 '17 at 02:23
  • glibc has a neat strategy for long strings: use `pminub` to combine a whole cache-line of vectors and find out if there were any zeros, then go back and find out where the first one was. – Peter Cordes Jul 27 '17 at 02:24

2 Answers2

5

No, the stack frame is, at least in theory, not always required. An optimizing compiler might in some cases avoid using the call stack. Notably when it is able to inline a called function (in some specific call site), or when the compiler successfully detects a tail call (which reuses the caller's frame).

Read the ABI of your platform to understand requirements related to the stack.

You might try to compile your program with link time optimization (e.g. compile and link with gcc -flto -O2) to get more optimizations.

In principle, one could imagine a compiler clever enough to (for some programs) avoid using any call stack.


BTW, I just compiled a naive recursive long fact(int n) factorial function with GCC 7.1 (on Debian/Sid/x86-64) at -O3 (i.e. gcc -fverbose-asm -S -O3 fact.c). The resulting assembler code fact.s contains no call machine instruction.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
1

Every C function sets up a stack frame using...

This is true for your compiler, not in general. It is possible to compile a C program without using the stack at all—see, for example, the method CPS, continuation passing style. Probably no C compiler on the market does so, but it is important to know that there are other ways to execute programs, in addition to stack-evaluation.

The ISO 9899 standard says nothing about the stack. It leaves compiler implementations free to choose whichever method of evaluation they consider to be the best.

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
alinsoar
  • 15,386
  • 4
  • 57
  • 74