printf gets stuck in an infinite loop with AL = 10 on x86-64 Linux with older gcc

Question

Very simple assembly introduction code.
Seems to compile ok through gcc -o prog1 prog1.s, then ./prog1 just skips a line and shows nothing, like waiting an input the code doesn't ask. What's wrong?
Using gcc (Debian 4.7.2-5) 4.7.2 in 64-bit gNewSense running on VMware. Code:

/*
int nums[] = {10, -21, -30, 45};
int main() {
  int i, *p;
  for (i = 0, p = nums; i != 4; i++, p++)
    printf("%d\n", *p);
  return 0;
}
*/

.data
nums:  .int  10, -21, -30, 45
Sf:  .string "%d\n"    # string de formato para printf

.text
.globl  main
main:

/********************************************************/
/* mantenha este trecho aqui e nao mexa - prologo !!!   */
  pushq   %rbp
  movq    %rsp, %rbp
  subq    $16, %rsp
  movq    %rbx, -8(%rbp)
  movq    %r12, -16(%rbp)
/********************************************************/

  movl  $0, %ebx  /* ebx = 0; */
  movq  $nums, %r12  /* r12 = &nums */

L1:
  cmpl  $4, %ebx  /* if (ebx == 4) ? */
  je  L2          /* goto L2 */

  movl  (%r12), %eax    /* eax = *r12 */

/*************************************************************/
/* este trecho imprime o valor de %eax (estraga %eax)  */
  movq    $Sf, %rdi    /* primeiro parametro (ponteiro)*/
  movl    %eax, %esi   /* segundo parametro  (inteiro) */
  call  printf       /* chama a funcao da biblioteca */
/*************************************************************/

  addl  $1, %ebx  /* ebx += 1; */
  addq  $4, %r12  /* r12 += 4; */
  jmp  L1         /* goto L1; */

L2:  
/***************************************************************/
/* mantenha este trecho aqui e nao mexa - finalizacao!!!!      */
  movq  $0, %rax  /* rax = 0  (valor de retorno) */
  movq  -8(%rbp), %rbx
  movq  -16(%rbp), %r12
  leave
  ret      
/***************************************************************/

It would make things a great deal easier if you could translate the comments to English and explain what sort of output you expect (I suppose the same output as the C program you listed above). — fuz, May 05 '20 at 20:36
For me, it works like the C code in your comment does. Are you sure you're compiling and running what you think you are? — Joseph Sible-Reinstate Monica, May 05 '20 at 20:38
@fuz You edited right. The portuguese comments are basic explanations/don't change this. — Ajna, May 05 '20 at 20:45
@JosephSible-ReinstateMonica Yes I am, as the commands indicate. — Ajna, May 05 '20 at 20:47
@Ajna You should double-check that. As it stands, your problem isn't reproducible. — Joseph Sible-Reinstate Monica, May 05 '20 at 20:49
@JosephSible-ReinstateMonica I'm x-checking that for hours.That's how I finally gave up and came here. — Ajna, May 05 '20 at 20:52
If you compile and run your C code, does it work the way you expect? If not, then that points to some problem with your system. — Joseph Sible-Reinstate Monica, May 05 '20 at 21:00
Yes I runned C on it the entire month, this problem only happened right now with assembly. — Ajna, May 05 '20 at 21:02
You should zero `%al` before `call printf` as you don't use any SSE registers for arguments. Still, that is unlikely to cause this problem. You could try running the program through `strace` or of course use a debugger. — Jester, May 05 '20 at 21:21
@Jester after `gcc -Wall -g prog1.s`, `gdb a.out`, `layout next`, `run` + ^C: `0x00007ffff7a9e1d0 jmpq *%rax` highlighted. In regular terminal: `Program received signal SIGINT, Interrupt. 0x00007ffff7a9e1d0 in printf () from /lib/x86_64-linux-gnu/libc.so.6` Now what? — Ajna, May 05 '20 at 22:29
That is very interesting. What is `p/a $rax`? If that points back to itself for whatever reason, then it would be an endless loop. — Jester, May 05 '20 at 22:33
A infinite loop is precisely what I suspect. Sorry I don't know what you mean by `p/a` but `%rax` is where the '0' return value of the `main` function is stored. If `$rax` refers to the memory address associated to it I SUPPOSE it's the mentioned above. Btw ran other assembly code slightly different and it's all good with the new one. — Ajna, May 05 '20 at 22:50
I meant in gdb when you are stopped the the `jmpq` do a `p/a $rax` to see the value. — Jester, May 05 '20 at 22:53
Program received signal SIGINT, Interrupt. `0x00007ffff7a9e1d0` in printf () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) p/a $rax `p/a $rax $1 = 0x7ffff7a9e1ca ` — Ajna, May 05 '20 at 23:02
Ahha yeah, that's pointing to just before the jmp so it's an endless loop. Very strange. — Jester, May 05 '20 at 23:06
Yeah... and just rolled smooth and peachy in onlineGDB right now. Guess we have a OS or VM stranger thing here. Not my thing at the moment, but thank you very much for the inputs anyhow. Learned some indirectly. — Ajna, May 05 '20 at 23:37
Wait, I just tried it in a gNewSense 4 VM, and I can reproduce the problem there. I may just be able to figure this out after all. — Joseph Sible-Reinstate Monica, May 05 '20 at 23:40
@Jester was right about needing to zero `%al`. Do that and it works. Full answer and explanation coming shortly. — Joseph Sible-Reinstate Monica, May 06 '20 at 00:03

Joseph Sible-Reinstate Monica · Accepted Answer · 2023-01-10T06:55:18.333

3

tl;dr: do xorl %eax, %eax before call printf.

printf is a varargs function. Here's what the System V AMD64 ABI has to say about varargs functions:

For calls that may call functions that use varargs or stdargs (prototype-less calls or calls to functions containing ellipsis (. . . ) in the declaration) %al¹⁸ is used as hidden argument to specify the number of vector registers used. The contents of %al do not need to match exactly the number of registers, but must be an upper bound on the number of vector registers used and is in the range 0–8 inclusive.

You broke that rule. You'll see that the first time your code calls printf, %al is 10, which is more than the upper bound of 8. On your gNewSense system, here's a disassembly of the beginning of printf:

printf:
   sub    $0xd8,%rsp
   movzbl %al,%eax                # rax = al;
   mov    %rdx,0x30(%rsp)
   lea    0x0(,%rax,4),%rdx       # rdx = rax * 4;
   lea    after_movaps(%rip),%rax # rax = &&after_movaps;
   mov    %rsi,0x28(%rsp)
   mov    %rcx,0x38(%rsp)
   mov    %rdi,%rsi
   sub    %rdx,%rax               # rax -= rdx;
   lea    0xcf(%rsp),%rdx
   mov    %r8,0x40(%rsp)
   mov    %r9,0x48(%rsp)
   jmpq   *%rax                   # goto *rax;
   movaps %xmm7,-0xf(%rdx)
   movaps %xmm6,-0x1f(%rdx)
   movaps %xmm5,-0x2f(%rdx)
   movaps %xmm4,-0x3f(%rdx)
   movaps %xmm3,-0x4f(%rdx)
   movaps %xmm2,-0x5f(%rdx)
   movaps %xmm1,-0x6f(%rdx)
   movaps %xmm0,-0x7f(%rdx)
after_movaps:
   # nothing past here is relevant for your problem

A quasi-C translation of the important bits is goto *(&&after_movaps - al * 4); (see Labels as Values). For efficiency, gcc and/or glibc didn't want to save more vector registers than you used, and it also doesn't want to do a bunch of conditional branches. Each instruction to save a vector register is 4 bytes, so it takes the end of the vector register saving instructions, subtracts al * 4 bytes, and jumps there. This results in just enough of the instructions executing. Since you had more than 8, it ended up jumping too far back, and landing before the jump instruction it just took, thus creating an infinite loop.

As for why it's not reproducible on modern systems, here's a disassembly of the beginning of their printf:

printf:
   sub    $0xd8,%rsp
   mov    %rdi,%r10
   mov    %rsi,0x28(%rsp)
   mov    %rdx,0x30(%rsp)
   mov    %rcx,0x38(%rsp)
   mov    %r8,0x40(%rsp)
   mov    %r9,0x48(%rsp)
   test   %al,%al          # if(!al)
   je     after_movaps     # goto after_movaps;
   movaps %xmm0,0x50(%rsp)
   movaps %xmm1,0x60(%rsp)
   movaps %xmm2,0x70(%rsp)
   movaps %xmm3,0x80(%rsp)
   movaps %xmm4,0x90(%rsp)
   movaps %xmm5,0xa0(%rsp)
   movaps %xmm6,0xb0(%rsp)
   movaps %xmm7,0xc0(%rsp)
after_movaps:
   # nothing past here is relevant for your problem

A quasi-C translation of the important bits is if(!al) goto after_movaps;. Why did this change? ~~My guess is Spectre. The mitigations for Spectre make indirect jumps really slow, so it's no longer worth doing that trick.~~ Or not; see comments. Instead, they do a much simpler check: if there's any vector registers, then save them all. With this code, your bad value of al isn't a disaster, since it just means the vector registers will be unnecessarily copied.

edited Jan 10 '23 at 06:55

answered May 06 '20 at 00:33

Joseph Sible-Reinstate Monica

45,431
5
48
98

2

*The mitigations for Spectre make indirect jumps really slow* - only slow if you armor them with `lfence` or something, which GCC doesn't do in general by default. I think this change predated Spectre; probably just because indirect branches are harder to predict, and FP printf is rare enough than dumping extra registers when you have one FP arg doesn't have much cost. (Especially on modern CPUs with good OoO exec and large store buffers.) Interesting discovery; I didn't know gcc variadic code-gen every did anything other than check `AL!=0`. – Peter Cordes May 06 '20 at 05:18
1

Another effect of this is that a bogus AL can't crash by jumping too far. So it's more robust against buggy hand-written code. IDK if that was any motivation at all. It also saves instructions in the no-FP fast path, just `test %al,%a` / `jz` instead of multiple ALU instructions to calculate a jump target. Seems like a good change to me regardless of Spectre. – Peter Cordes May 06 '20 at 05:21
The TL;DR line worked indeed. An interesting follow up is that the slightly different program https://onlinegdb.com/r1Yd5py9I when with a greater than 8 value to be printed (by adding 5 to any of the summed values) it goes `invalid operation` instead of infinite loop this time. I wonder why. – Ajna May 06 '20 at 05:34
2

@Ajna Since the problem is it's jumping wildly, with values other than 10, it's probably ending up jumping to halfway inside of some instruction that doesn't happen to be some other valid instruction, and is thus getting `SIGILL` Illegal Instruction. – Joseph Sible-Reinstate Monica May 06 '20 at 05:36
So, to wrap it up, we have a gNewSense issue here? Because in onlineGDB and in my colleagues/teacher Fedora it works just fine. – Ajna May 06 '20 at 05:40
2

@Ajna No, it's not an issue with gNewSense. It was an issue with your code. Your code broke one of the rules of the ABI, and it just so happens that newer systems are more lenient about the rule you broke than older ones are (i.e., on newer systems it's just slightly slower instead of completely broken). – Joseph Sible-Reinstate Monica May 06 '20 at 05:42
3

@Ajna: It's not rare for buggy asm code to work by accident / happen to work. Other ABI violations like modifying a call-preserved register also often don't cause a problem with simple callers, but will break other code. Throwing code at the wall and seeing what sticks works even less well in asm than in other languages. Don't depend on trial and error. (Although it can find things that definitely *don't* work, e.g. like here where it breaks on one test system.) – Peter Cordes May 06 '20 at 10:39
@PeterCordes I don't believe a top 3 national and top 1 private computer science college code would be throwing code at the wall or depend in trial and error, but ok, noted. – Ajna May 06 '20 at 14:13
3

@Ajna: Is that where the ABI-violating code in the question was from? You didn't say that until now, but I guess that explains why you kept thinking it must be a bug in gNewSense even after the bug in that code was explained. Bugs do happen by accident even when you know what you're doing and just forget something. For the same reasons intentional trial and error is unsafe, it's easy to miss such bugs when testing on systems where it happens to work. Often a good idea to start with or compare against C compiler output; compilers don't make mistakes in following the calling convention. – Peter Cordes May 06 '20 at 14:31
1

@PeterCordes fun fact: the code in the exercise following this one has right before call printf a new 'movl $0 %eax' attached to it :P – Ajna May 09 '20 at 23:35
2

@JosephSible-ReinstateMonica: Re: efficiency advantages of the test/jz way over the computed-jump way: I wrote a big footnote about that in an answer to [Why does printf still work with RAX lower than the number of FP args in XMM registers?](https://stackoverflow.com/a/71985420). Not exactly a duplicate, but the answer basically has to explain the same details. – Peter Cordes Apr 24 '22 at 04:04

printf gets stuck in an infinite loop with AL = 10 on x86-64 Linux with older gcc

1 Answers1

Linked

Related