Why does rax and rdi work the same in this situation?

Question

I have made this code :

global  strlen
    ; int strlen(const char *string);
strlen:
    xor     rcx, rcx

retry:
    cmp byte    [rdi + rcx], 0
    je      result
    inc     rcx
    jmp     retry

result:
    mov     rax, rcx
    ret

And this is how I test it :

#include <stdio.h>

int main(int argc, char **argv)
{
    char* bob = argv[1];
    printf("%i\n", strlen(bob));
    return 0;
}

This is a working strlen, no problem here but I've noticed that I can switch the rdi in the first line of the retry block for a rax without it changing anything, I don't know if this is normal behavior. which of those values should I keep ?

looks like AMD64 ABI abi (Linux e.g.) That RAX and RDI have the same content on entry is pure random (e.g. you're passing the return value of a prior called function to the first parameter of another), and afaik guaranteed nowhere — Tommylee2k, Feb 25 '19 at 13:59
Do you compile with `gcc` and no optimizations? In `-O0` it will use `rax` to prepare the pointer value into memory and into `rdi` as argument for function call, so they accidentally contain identical value. Try `-O3` to get optimized machine code, which will load `rdi` directly (and `rax` will contain whatever the CRT library initialization left there, i.e. highly likely something else). (generally in machine code produced by compiler, either something is as defined in specification/ABI, or it is accident, and you should never rely on particular "found out" feature, may break in next build) — Ped7g, Feb 25 '19 at 14:28
@Ped7g I'm not using any optimization flags, the purpose of this question is to understand how this append because I'm learning assembly. I believe you answered the question and I should keep `rdi` in my code — Comte_Zero, Feb 25 '19 at 14:33
not related to main question, but why not use `xor al,al; repne scasb` here for found 0 byte (end of string) ? — RbMm, Feb 25 '19 at 14:34
yes, I believe this is "duplicate" of the "what is ABI for my target platform", which you didn't specify (and I can't find some meta-answer about 64b ABIs right now, although I'm pretty sure there is at least one). I.e. you should read the ABI specs (how functions are called), and hold to that, any accidental values elsewhere, or getting away by modifying register which should have been preserved and "nothing happens", all of that is just temporal accident, which may change any time in future by next build (newer compiler version, other options, or slight change in source = diff. machine code) — Ped7g, Feb 25 '19 at 14:35
You may for example check this lengthy answer to get better idea how things may accidentally work to quite some extent (confusing many beginners as for example their code works on their linux box, but does crash in Linux subsystem of Windows 10 (for example)). But in the end the proper solution is to do it properly, as all those "works" is fragile abuse of current implementation/situation, not permanent solution: https://stackoverflow.com/questions/46087730/what-happens-if-you-use-the-32-bit-int-0x80-linux-abi-in-64-bit-code (but if you are wondering where values belong, search for 64b ABI) — Ped7g, Feb 25 '19 at 14:44
@RbMm I don't know if this would be faster in a meaningful way, if not I find the above code more readable — Comte_Zero, Feb 25 '19 at 15:06
@Comte_Zero: repne scasb is not fast; it still only checks 1 byte per clock. Only `rep movs` and `rep stos` have optimized microcode on modern CPUs that operates up to 64 bytes at a time. You can make strlen go that fast for long strings with SIMD vectors (SSE2 / AVX2 / AVX512BW, see implementations in glibc). But your loop can only check 1 byte per 2 cycles on Intel before Haswell, because your loop has *2* jumps, one taken and one not-taken per iteration. See [Why are loops always compiled into "do...while" style (tail jump)?](//stackoverflow.com/a/47790760) — Peter Cordes, Feb 25 '19 at 22:46
**Also note that you can't properly test a function with the same name as a standard library function, unless you use `-fno-builtin-strlen`**. Otherwise GCC is free to optimize away `strlen("abc")` to a constant `3`. Much easier to just call it `my_strlen` because then you can compare it against library / builtin strlen. — Peter Cordes, Feb 25 '19 at 22:47

score 6 · Accepted Answer · answered Feb 25 '19 at 15:28

6

It's just bad luck.

GCC 8, without optimisations, uses rax as an intermediary location to move argv[1] to bob and to move the latter into the first parameter of strlen:

  push rbp
  mov rbp, rsp
  sub rsp, 32

  mov DWORD PTR [rbp-20], edi             ;argc
  mov QWORD PTR [rbp-32], rsi             ;argv

  mov rax, QWORD PTR [rbp-32]             ;argv
  mov rax, QWORD PTR [rax+8]              ;argv[1]
  mov QWORD PTR [rbp-8], rax              ;bob = argv[1]

  mov rax, QWORD PTR [rbp-8]
  mov rdi, rax
  call strlen                             ;strlen(bob)

  mov esi, eax
  mov edi, OFFSET FLAT:.LC0
  mov eax, 0
  call printf

  mov eax, 0
  leave
  ret

This is just bad luck, it's not a documented behaviour, in fact it fails if you use a string literal:

printf("%i\n", strlen("bob"));

  mov edi, OFFSET FLAT:.LC1
  call strlen                     ;No RAX here

  mov esi, eax
  mov edi, OFFSET FLAT:.LC0
  mov eax, 0
  call printf

The document specifying how to parameters are passed to function is your OS ABI, read more in this answer.

GCC generates "dumb" code that uses the registers a lot when the optimisations are disabled, this eases the debugging (both of the GCC engine and the program compiled) and essentially mimics a beginners: first the variable is read from memory and put in the first free register (one problem solved), then it is copied in the right register (another one gone) and finally the call is made.
GCC just picked up the first free register, in this simple program there is no registers pressure and rax is always picked up.

answered Feb 25 '19 at 15:28

Margaret Bloom

41,768
5
78
124

Does gcc have any "official" way of requesting that it perform things like redundant-register-move optimizations but retain the same "high-level assembler" semantics associated with `-O0`? I know that it might be possible to achieve such semantics with today's gcc by explicitly disabling each and every optimization that might break such semantics, but that would only work until the next time gcc adds a breaking optimization not on the list. – supercat Feb 25 '19 at 16:04
@supercat I have found nothing looking at the [GCC documentation](https://gcc.gnu.org/onlinedocs/gcc-8.3.0/gcc/Optimize-Options.html#Optimize-Options) but to be sure one should look at each option individually. I've tried a few flags with a `no` prefix but I don't think GCC recognises them as a request to disable an optimisation. A more conservative approach would be to only enable the optimisations known not to break the wanted behaviour. Though, future versions may change this. – Margaret Bloom Feb 25 '19 at 16:37
I figured that was probably still the case, but perhaps someday the authors of gcc will get around to allowing it to generate better code than 1990s-era compilers while supporting the same "popular extensions". – supercat Feb 25 '19 at 17:11
@supercat: I'm not sure why any of that would be useful. The x86-64 System V ABI passes args in registers, so you don't have to use crappy workarounds to get args in regs. Or in 32-bit mode, `__attribute__((regparm(3))) int foo(int,int);` will use EAX and ECX to pass args. Further customization of the calling convention beyond `-fcall-used-ebx` or `-fcall-saved-edx` might be interesting, but I don't see how *redundant* `mov` to pass args in both RDI and RAX would ever be better than just including a `mov eax, edi` in the callee. – Peter Cordes Feb 25 '19 at 22:42
@PeterCordes: When used with `-O0`, gcc seems to support the "popular extensions" common to 1990s compilers, including `volatile` semantics sufficient to implement a mutex on a single-core system with interrupts. Unfortunately, it generates code that contains a lot of redundant register moves as well as loads and stores of objects whose address is never taken. When used with any optimization setting other than `-O0`, however, the only way I can find to make gcc honor such guarantees is to individually disable every optimization that would violate them. – supercat Feb 25 '19 at 23:09
@supercat: What exact behaviour are you referring to as a "popular extension"? I haven't used 1990s compilers, but it sounds like you're talking about choosing to compile `mut++` to `add dword [mut], 1` as one instruction, giving you uniprocessor atomicity vs. interrupts. Except gcc doesn't actually do that at `-O0`, only with optimization enabled (when tuning for a machine where that's good, not `-march=pentium`). https://godbolt.org/z/JcyD0F And it doesn't do it for `volatile int xv`, only for non-`volatile`: its optimizer won't fold volatile load+store into one instruction. – Peter Cordes Feb 25 '19 at 23:22
@supercat: GCC doesn't officially support any extensions that allow efficient uniprocessor atomics (without a `lock` prefix). For GCC, supported == documented in the gcc manual. Happens-to-work behaviour is just a coincidence and can't be relied on. – Peter Cordes Feb 25 '19 at 23:26
@PeterCordes: I'm not looking for anything nearly so exotic. More along the lines of ensuring that if code writes some data in a buffer and then writes a volatile `ready` flag, the generated code won't write the `ready` flag until it's written the data into the buffer. – supercat Feb 25 '19 at 23:32
@supercat: ah, ok. Treating `volatile` as a release-store that can't reorder with earlier non-volatile stores. I wasn't aware it was possible to implement a mutex with only release stores, no atomic RMW. On strongly-ordered x86, that would make a `ready` flag work even in a multi-core case. Anyway, no, nothing will make `volatile` behave that way with optimization enabled, because neither ISO C no GNU C require that. If you want a release store to an arbitrary variable, use `__atomic_store_n( &ready, 1, __ATOMIC_RELEASE);` for `int ready;` Safe at `-O3` on all targets in GNU C. – Peter Cordes Feb 25 '19 at 23:39
@PeterCordes: ISO does not require such semantics because many kinds of applications don't need them, and because the Committee expected people seeking to produce compilers that are suitable for various tasks would be better placed than the Committee to recognize what special features may be necessary to effectively perform such tasks. – supercat Feb 25 '19 at 23:52
@PeterCordes: In any case, I repeat my question: is there any way to make gcc support the kind of semantics that 1990s compilers could support without generating code that would have been considered inefficient even by 1990s standards? – supercat Feb 25 '19 at 23:54
@supercat: Not from source that uses plain `volatile`. But if you can modify the source, then yes you can easily use `asm("" ::: "memory");` as a barrier against compile-time reordering, or use `__atomic_store_n( &ready, 1, __ATOMIC_RELEASE);` to do a release-store. (Which on x86 is just a plain store while disallowing reordering with earlier atomic and non-atomic / non-volatile stores. It's totally sufficient for a data-is-ready flag for a buffer.) here are source-level ways to describe exactly what you want without needing to overload `volatile` for it, GCC didn't choose to provide that. – Peter Cordes Feb 26 '19 at 03:08
@PeterCordes: One of the measures of quality for general-purpose compilers has generally been the ability to efficiently process a wide range of existing programs. I wonder why the authors of gcc seem so keen to eschew compatibility except in `-O0` mode which yields really terrible results unless one uses `register` storage classes, and even then is still not very efficient? – supercat Feb 26 '19 at 06:09
@supercat: Supporting every happens-to-work "feature" of old compilers with weak optimizers doesn't seem like a good path forward for maintainability of GCC's *own* codebase. Multithreading before C11 and C++11 was a patchwork of poorly-defined stuff outside of projects like the Linux kernel that used GNU C inline asm and `volatile` to get the exact behaviour they wanted. I don't see a lot of benefit in keeping around old bad ways of writing multithreaded code. I'll buy your argument for stuff like signed overflow (`-fwrapv`) and other programmer hostility, but not for this case. – Peter Cordes Feb 26 '19 at 06:24
@PeterCordes: When C was invented, it didn't include directives to block reordering because its semantics were defined even in their absence. The fact things worked wasn't happenstance. If the Standard had included a compiler-reordering barrier, or would even have a standard way by which compilers could provide one without also having to supply broken atomic primitives that they cannot meaningfully support, it might be favorable to migrate toward them, but as yet that hasn't happened. What I'm asking for are the proper semantics for interrupt-based or DMA-based I/O on single-core platforms. – supercat Feb 26 '19 at 07:43
@supercat: Ok then yes, in GNU C you'd use `asm("" ::: "memory");` before a `volatile` store to an MMIO register that initiates a DMA read. Or you could equally use `__atomic_store_n( dma_initiate_port, buf_address, __ATOMIC_RELEASE);` for a `volatile` pointer. I think C11 `atomic_signal_fence(memory_order_release)` should also be sufficient as a memory barrier that works on the current core (wrt. interrupts/signal handlers). – Peter Cordes Feb 26 '19 at 08:08
What was the purpose of having the Standard include `volatile`, if not to allow programmers to achieve the same semantics as had been available in the days before optimization without compiler-specific syntax? The behavior of `volatile` is Implementation-Defined, which would suggest that the authors of the Standard recognized that many implementations should do something beyond the bare minimum required by the Standard, and I find implausible the notion that the authors of the Standard intended to require the use of compiler-specific syntax to achieve traditional semantics. – supercat Feb 26 '19 at 08:24
@PeterCordes: PS--What should a quality implementation be expected to do if its execution environment would be unable to support e.g. an atomic 64-bit increment with any semantics better than "blindly perform the increment if no contention *from things the compiler knows about*, otherwise deadlock"? An inability to support a usable 64-bit increment shouldn't prevent an implementation from providing a basic reordering barrier, but I know of no Standard-defined means by which an implementation can support the latter without claiming to support the former. – supercat Feb 26 '19 at 17:44
@PeterCordes: Actually, looking through N1570 7.17, fences are only meaningful with regard to objects declared as atomic. So the Standard still fails to suggest any means by which programs can use mutex constructs to guard ordinary objects. – supercat Feb 26 '19 at 18:05
@supercat: release-store/acquire-load establishes a "synchronizes-with" relationship, allowing another thread to safe read a non-atomic buffer after doing an acquire load that sees `ready==1`. (Because the non-atomic assignments to the buffer in the producer thread are ordered before the release store.) Maybe this doesn't work with barriers, but it works with `memory_order_release` stores. `atomic_flag` is guaranteed lock-free, regardless of whether any of `_Atomic int` or whatever are atomic, so it's usable for this even if `_Atomic int64_t` uses a lock. – Peter Cordes Feb 26 '19 at 18:27
@supercat: In C++11, atomics are not optional. In C11 they are optional, but presumably the standard authors decided that the chance of a platform existing where `atomic_flag` atomic RMW couldn't be implemented, but where release/acquire semantics could be, was not worth worrying about. Or could be left to implementation-specific extensions. – Peter Cordes Feb 26 '19 at 18:29
@PeterCordes: For a lock-based atomics to work meaningfully, everything that might try to use them must coordinate their use. If a platform's ABI doesn't define a means of coordination, freestanding compilers targeting that platform won't have any way to coordinate their use of locks with anything else that might use the same objects. On the flip side, I find it rather hard to imagine any execution platform that would be unable to produce the behavior that would be required when calling an unknown function that doesn't happen to change the bit patterns in any storage. – supercat Feb 26 '19 at 18:50

Why does rax and rdi work the same in this situation?

1 Answers1

Linked