1

When watching the creation of process when tapping ‘ls‘ in terminal, set breakpoint at copy_thread of arch/x86/kernel/process.c with gdb, then print values of pt_regs.

{bx = 0x1200011, cx = 0x0, dx = 0x0, si = 0x0, di = 0xa0f38e8, bp = 0x8266000,
  ax = 0xffffffda, ds = 0x7b, __dsh = 0x0, es = 0x7b, __esh = 0x0, fs = 0x0, __fsh = 0x0,
  gs = 0x33, __gsh = 0x0, orig_ax = 0x78, ip = 0xb7f29549, cs = 0x73, __csh = 0x0, flags = 0x206,
  sp = 0xbfab35f0, ss = 0x7b, __ssh = 0x0}

the bp of pt_regs is 0x8266000, sp of pt_regs is 0xbfab35f0. I have find the place where they are assiged. the sp of pt_regs is assigned in do_SYSENTER_32 of arch/x86/entry/common.c

__visible noinstr long do_SYSENTER_32(struct pt_regs *regs)
{
    /* SYSENTER loses RSP, but the vDSO saved it in RBP. */
    regs->sp = regs->bp;

    /* SYSENTER clobbers EFLAGS.IF.  Assume it was set in usermode. */
    regs->flags |= X86_EFLAGS_IF;

    return do_fast_syscall_32(regs);
}

the bp of pt_regs is assigned in __do_fast_syscall_32 by get_user. It seems from userspace value.

static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
{
    // do other stuff...

    /* Fetch EBP from where the vDSO stashed it. */
    if (IS_ENABLED(CONFIG_X86_64)) {
        /*
         * Micro-optimization: the pointer we're following is
         * explicitly 32 bits, so it can't be out of range.
         */
        res = __get_user(*(u32 *)&regs->bp,
             (u32 __user __force *)(unsigned long)(u32)regs->sp);
    } else {
        res = get_user(*(u32 *)&regs->bp,
               (u32 __user __force *)(unsigned long)(u32)regs->sp);
    }

    // do other stuff...
    return true;
}

and the stack shows the order of functions.

#0  copy_thread (clone_flags=clone_flags@entry=18874368, sp=0, arg=0, p=0xc31c0a00, tls=0)
    at arch/x86/kernel/process.c:133
#1  0xc1058722 in copy_process (pid=pid@entry=0x0, trace=trace@entry=0, node=node@entry=-1, 
    args=<optimized out>) at kernel/fork.c:2122
#2  0xc10593cc in kernel_clone (args=args@entry=0xc68e9f38) at kernel/fork.c:2500
#3  0xc1059807 in __do_sys_clone (child_tidptr=0xa0f38e8, tls=0, parent_tidptr=0x0, newsp=0, 
    clone_flags=<optimized out>) at kernel/fork.c:2617
#4  __se_sys_clone (child_tidptr=168769768, tls=0, parent_tidptr=0, newsp=0, 
    clone_flags=<optimized out>) at kernel/fork.c:2585
#5  __ia32_sys_clone (regs=<optimized out>) at kernel/fork.c:2585
#6  0xc1b04b85 in do_syscall_32_irqs_on (nr=<optimized out>, regs=0xc68e9fb4)
    at arch/x86/entry/common.c:77
#7  __do_fast_syscall_32 (regs=regs@entry=0xc68e9fb4) at arch/x86/entry/common.c:140
#8  0xc1b04c29 in do_fast_syscall_32 (regs=0xc68e9fb4) at arch/x86/entry/common.c:165
#9  0xc1b04c75 in do_SYSENTER_32 (regs=<optimized out>) at arch/x86/entry/common.c:208
#10 0xc1b0e32f in entry_SYSENTER_32 () at arch/x86/entry/entry_32.S:952
#11 0x01200011 in ?? ()
#12 0x00000000 in ?? ()

I have doubt below:

  • Why do the ebp and esp stored in pt_regs differ so greatly?
  • Why is the value of ebp stored in pt_regs smaller than the value of
    esp stored in pt_regs, since the stack grows downward?

I used the debuggable linux-5.12.10,and the command 'ls' is compiled from busybox.

sunhang
  • 367
  • 2
  • 11
  • Do you have add_random_kstack_offset enabled? – stark Jul 10 '23 at 10:52
  • @stark no, there is no 'add_random_kstack_offset' in common.c from linux-5.12.10 – sunhang Jul 10 '23 at 12:03
  • 1
    `regs->sp` points to the current position on the user stack. After the `get_user()` macro call in `__do_fast_syscall_32()`, `regs->bp` will contain the *arg6* value that was passed on the user stack for the fast system call. Most system calls do not use *arg6* so it can be considered to be a junk value for those system calls. For example, the `clone` system call on IA32 is not defined to use *arg6*. – Ian Abbott Jul 11 '23 at 12:14
  • 1
    For comparison, the legacy INT $0x80 system call mechanism passes the *arg6* value in the EBP register directly (not on the user stack), but it still ends up in `regs->bp`. – Ian Abbott Jul 11 '23 at 12:32

2 Answers2

3

Consider the difference in register and stack usage for the legacy INT $0x80 system call mechanism and the modern fast system call mechanism for IA32:

Register / stack Legacy system call Fast system call
eax system call number system call number
ebx arg1 arg1
ecx arg2 arg2
edx arg3 arg3
esi arg4 arg4
edi arg5 arg5
ebp arg6 user stack pointer
arg on user stack arg6

For the fast system call mechanism, when entry_SYSENTER_32 constructs the struct pt_regs entry on the kernel stack, the sp member will point to the kernel stack and the bp member will point to the user stack. Therefore, the fast system call mechanism fixes up the sp and bp members for compatibility with the legacy system call mechanism. The sp member value is corrected in do_SYSENTER_32():

    /* SYSENTER loses RSP, but the vDSO saved it in RBP. */
    regs->sp = regs->bp;

The bp member value is corrected in __do_fast_syscall_32(), setting it to the arg6 value from the user stack:

    /* Fetch EBP from where the vDSO stashed it. */
    if (IS_ENABLED(CONFIG_X86_64)) {
        /*
         * Micro-optimization: the pointer we're following is
         * explicitly 32 bits, so it can't be out of range.
         */
        res = __get_user(*(u32 *)&regs->bp,
             (u32 __user __force *)(unsigned long)(u32)regs->sp);
    } else {
        res = get_user(*(u32 *)&regs->bp,
               (u32 __user __force *)(unsigned long)(u32)regs->sp);
    }

When do_syscall_32_irqs_on() is called from do_int80_syscall_32() (for the legacy system call mechanism) or from __do_fast_syscall_32() (for the fast system call mechanism), the regs->bp and regs->sp values will be as expected no matter which of the system call mechanisms was used.


Another fix-up for fast system calls occurs for regs->ip. The original value of the EIP register is lost by the sysenter instruction, which is normally executed from the __kernel_vsyscall() function in the vDSO. regs->ip is corrected in do_fast_syscall_32():

    /*
     * Called using the internal vDSO SYSENTER/SYSCALL32 calling
     * convention.  Adjust regs so it looks like we entered using int80.
     */
    unsigned long landing_pad = (unsigned long)current->mm->context.vdso +
                    vdso_image_32.sym_int80_landing_pad;

    /*
     * SYSENTER loses EIP, and even SYSCALL32 needs us to skip forward
     * so that 'regs->ip -= 2' lands back on an int $0x80 instruction.
     * Fix it up.
     */
    regs->ip = landing_pad;

The vDSO contains an int $0x80 instruction immediately after the sysenter instruction. The landing_pad value is the address just after that int $0x80 instruction, so that instruction will not be reached when returning from the fast system call.

The reason for the int $0x80 instruction in the vDSO is to support older CPUs that lack the sysenter and sysexit instructions. In that case, the mov %esp, %ebp; sysenter instruction sequence in __kernel_vsyscall() in the vDSO will be replaced with nop instructions and the CPU will reach the int $0x80 instruction that immediately follows that instruction sequence, effectively changing the fast system call into a legacy system call for older CPUs. That legacy system call will return to the point just after the int $0x80 instruction just like the fast system call.

Ian Abbott
  • 15,083
  • 19
  • 33
0

Eighteen days later, I did the experiment again and noticed a strange phenomenon. I use 32-bit ubuntu system to compile a program, the simplified program is

#include<unistd.h>

int main(int argc, char* argv[]) {
    fork();
    return 0;
}

Place the compiled a.out into my rootfs.img.gz and start qemu

qemu-system-i386 -m 256m -kernel ./bzImage -initrd ./rootfs.img.gz -append "root=/dev/ram init=/linuxrc nokaslr" -serial file:output.txt -s -S

Then use gdb, set break copy_thread

I enter the command ./a.out in shell of linux in qemu.

Because there is a code *childregs = *current_pt_regs() in copy_thread function, I can watch the user stack info by print childregs

When shell create process of a.out, the linux kernel stopped at copy_thread. At the time I enter p/x *childregs

(gdb) p/x *childregs
$10 = {bx = 0x1200011, cx = 0x0, dx = 0x0, si = 0x0, di = 0x9acb3e8, 
  bp = 0x8289000, ax = 0xffffffda, ds = 0x7b, __dsh = 0x0, es = 0x7b, 
  __esh = 0x0, fs = 0x0, __fsh = 0x0, gs = 0x33, __gsh = 0x0, orig_ax = 0x78, 
  ip = 0xb7f93549, cs = 0x73, __csh = 0x0, flags = 0x216, sp = 0xbfe797ec, 
  ss = 0x7b, __ssh = 0x0}

The shell's stack info is bp = 0x8289000,sp = 0xbfe797ec. bp's value is very strange.

When a.out run fork(), the linux kernel stopped at copy_thread again. At the time I enter p/x *childregs.

(gdb) p/x *childregs
$11 = {bx = 0x1200011, cx = 0x0, dx = 0x0, si = 0x0, di = 0xb7eeb128, 
  bp = 0xbfa23818, ax = 0xffffffda, ds = 0x7b, __dsh = 0x0, es = 0x7b, 
  __esh = 0x0, fs = 0x0, __fsh = 0x0, gs = 0x33, __gsh = 0x0, orig_ax = 0x78, 
  ip = 0xb7ef0549, cs = 0x73, __csh = 0x0, flags = 0x246, sp = 0xbfa237d0, 
  ss = 0x7b, __ssh = 0x0}

The a.out stack info is bp = 0xbfa23818,sp = 0xbfa237d0. bp's value is larger than sp's value, and the differ between the values is not so much greatly. This is what I expected

But when shell create a.out process, the bp is 0x8289000, I don't know what exactly happened at that time.

sunhang
  • 367
  • 2
  • 11
  • the program should be compiled in 32 bit ubuntu. the ld-linux.so.2 and libc.so.6 should also be from 32 bit ubuntu. – sunhang Jul 29 '23 at 09:49