The problem lies deep in the bowels of GAS, the GNU assembler, and how it generates DWARF debug information.
The compiler, GCC, has the responsibility of generating a specific sequence of instructions for a position-independent thread-local access, which is documented in the document ELF Handling for Thread-Local Storage, page 22, section 4.1.6: x86-64 General Dynamic TLS Model. This sequence is:
0x00 .byte 0x66
0x01 leaq x@tlsgd(%rip),%rdi
0x08 .word 0x6666
0x0a rex64
0x0b call __tls_get_addr@plt
, and is the way it is because the 16 bytes it occupies leave space for backend/assembler/linker optimizations. Indeed, your compiler generates the following assembler for threadMain()
:
threadMain:
.LFB2:
.file 1 "thread.c"
.loc 1 14 0
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movq %rdi, -8(%rbp)
.loc 1 15 0
.byte 0x66
leaq obj@tlsgd(%rip), %rdi
.value 0x6666
rex64
call __tls_get_addr@PLT
movl $1, (%rax)
.loc 1 16 0
...
The assembler, GAS, then relaxes this code, which contains a function call (!), down to just two instructions. These are:
- a
mov
having an fs:
-segment override, and
- a
lea
, in the final assembly. They occupy between themselves 16 bytes in total, demonstrating why the General Dynamic Model instruction sequence is designed to require 16 bytes.
(gdb) disas/r threadMain
Dump of assembler code for function threadMain:
0x00000000004007f0 <+0>: 55 push %rbp
0x00000000004007f1 <+1>: 48 89 e5 mov %rsp,%rbp
0x00000000004007f4 <+4>: 48 83 ec 10 sub $0x10,%rsp
0x00000000004007f8 <+8>: 48 89 7d f8 mov %rdi,-0x8(%rbp)
0x00000000004007fc <+12>: 64 48 8b 04 25 00 00 00 00 mov %fs:0x0,%rax
0x0000000000400805 <+21>: 48 8d 80 f8 ff ff ff lea -0x8(%rax),%rax
0x000000000040080c <+28>: c7 00 01 00 00 00 movl $0x1,(%rax)
So far, everything has been done correctly. The problem now begins as GAS generates DWARF debug information for your particular assembler code.
While parsing line-by-line in binutils-x.y.z/gas/read.c
, function void
read_a_source_file (char *name)
, GAS encounters .loc 1 15 0
, the statement that begins the next line, and runs the handler void dwarf2_directive_loc (int dummy ATTRIBUTE_UNUSED)
in dwarf2dbg.c
. Unfortunately, the handler does not unconditionally emit debug information for the current offset within the "fragment" (frag_now
) of machine code it is currently building. It could have done this by calling dwarf2_emit_insn(0)
, but the .loc
handler currently only does so if it sees multiple .loc
directives consecutively. Instead, in our case it continues on to the next line, leaving the debug information unemitted.
On the next line it sees the .byte 0x66
directive of the General Dynamic sequence. This is not, in and of itself, part of an instruction, despite representing the data16
instruction prefix in x86 assembly. GAS acts upon it with the handler cons_worker()
, and the fragment increases from 12 bytes to 13 in size.
On the next line it sees a true instruction, leaq
, which is parsed by calling the macro assemble_one()
that maps to void md_assemble (char *line)
in gas/config/tc-i386.c
. At the very end of that function, output_insn()
is called, which itself finally calls dwarf2_emit_insn(0)
and causes debug information to be emitted at last. A new Line Number Statement (LNS) is begun that claims that line 15 began at function-start-address plus previous fragment size, but since we passed over the .byte
statement before doing so, the fragment is 1 byte too large, and the computed offset for the first instruction of line 15 is therefore 1 byte off.
Some time later GAS relaxes the Global Dynamic Sequence to the final instruction sequence that starts with mov fs:0x0, %rax
. The code size and all offsets remain unchanged because both sequences of instructions are 16 bytes. The debug information is unchanged, and still wrong.
GDB, when it reads the Line Number Statements, is told that the prologue of threadMain()
, which is associated with the line 14 on which is found its signature, ends where line 15 begins. GDB dutifully plants a breakpoint at that location, but unfortunately it is 1 byte too far.
When run without a breakpoint, the program runs normally, and sees
64 48 8b 04 25 00 00 00 00 mov %fs:0x0,%rax
. Correctly placing the breakpoint would involve saving and replacing the first byte of an instruction with int3
(opcode 0xcc
), leaving
cc int3
48 8b 04 25 00 00 00 00 mov (0x0),%rax
. The normal step-over sequence would then involve restoring the first byte of the instruction, setting the program counter eip
to the address of that breakpoint, single-stepping, re-inserting the breakpoint, then continuing the program.
However, when GDB plants the breakpoint at the incorrect address 1 byte too far, the program sees instead
64 cc fs:int3
8b 04 25 00 00 00 00 <garbage>
which is a wierd but still valid breakpoint. That's why you didn't see SIGILL (illegal instruction).
Now, when GDB attempts to step over, it restores the instruction byte, sets the PC to the address of the breakpoint, and this is what it sees now:
64 fs: # CPU DOESN'T SEE THIS!
48 8b 04 25 00 00 00 00 mov (0x0),%rax # <- CPU EXECUTES STARTING HERE!
# BOOM! SEGFAULT!
Because GDB restarted execution one byte too far, the CPU does not decode the fs:
instruction prefix byte, and instead executes mov (0x0),%rax
with the default segment, which is ds:
(data). This immediately results in a read from address 0, the null pointer. The SIGSEGV promptly follows.
All due credits to Mark Plotnick for essentially nailing this.
The solution that was retained is to binary-patch cc1
, gcc
's actual C compiler, to emit data16
instead of .byte 0x66
. This results in GAS parsing the prefix and instruction combination as a single unit, yielding the correct offset in the debug information.