55

I read in the 3rd chapter of the "Linux Kernel Development, Second Edition" by Robert Love (ISBN:0-672-32720-1) that the clone system call is used to create a thread in Linux. Now the syntax of clone is such that a starting routine/function address is needed to be passed to it.

But then on the same page it is written that fork calls clone internally. So my question is, how do child process created by fork starts running the part of code which is after fork call, i.e. how does it not require a function as starting point?

If the links I provided have incorrect info, then please guide me to some better links/resources.

Armen Michaeli
  • 8,625
  • 8
  • 58
  • 95
0xF1
  • 6,046
  • 2
  • 27
  • 50
  • 1
    A function as parameter is just an address in memory. In assembly level you would see it can simply pop the return address from the stack and use it as target for the new thread entry point. – Havenard Sep 19 '13 at 20:29
  • The page you link to from the “this” text is the `clone` documentation, the same as the page you link to from the “syntax” text. Perhaps you meant to link to the `fork` documentation. That documentation says that `fork` calls `clone` with flags set to `SIGCHLD`. Presumably that tells `clone` to change its regular behavior and continue execution as a return from the call rather than calling a new routine. I would question whether `SIGCHLD` is correct; I would expect something more like `CLONE_CHILD`. – Eric Postpischil Sep 19 '13 at 20:34
  • @Havenard : Do you mean to say the it will save/push the address of next instruction (which PC will be storing) in stack and use it after creating the child? So that means `clone()` uses function address (passed through say, `pthread_create()`) when creating a thread and while creating process, it directly uses the return address from stack. – 0xF1 Sep 19 '13 at 20:38
  • @EricPostpischil : Sorry for wrong link, I corrected that. – 0xF1 Sep 19 '13 at 20:41
  • In assembly level, when you perform a `call`, it automatically pushes to the stack the address of the instruction right after it. When the function you are calling perform a `ret`, it automatically pop this address back and jumps to it, so everything continues flowing. So when you call `fork()`, the address of the instruction where it should continue executing after this call is already in the stack by default, you only have to read and use it. – Havenard Sep 19 '13 at 20:42
  • @Havenard: That would not be a proper implementation of `fork`. Upon return from `fork`, the stack should look like a return from `fork` (identical to the stack just prior to the call), not like a subroutine call to the place from which `fork` was called. – Eric Postpischil Sep 19 '13 at 20:46
  • Eric actually, no, the stack should just be exactly the same, because the return value of a function is not put in the stack, its in the register `AEX`. Everything else is exactly the same, including the addresses, because the process was simply cloned to a new virtual memory space. – Havenard Sep 19 '13 at 20:48
  • `void where_to() { long i; printf("my return point is 0x%08x", (&i)[1]); }` I didn't test but that would be basically how you retrieve the return address of your function in a 32 bits thread. The `call` from which your function was called is the instruction right before this address. – Havenard Sep 19 '13 at 20:53
  • @Havenard: The `clone` process normally **calls** the function passed in the `fn` parameter, and it passes it an argument. Therefore, if you try to implement `fork` by passing the return address for the `fork` call to `clone`, you will have a stack and/or registers into which have been placed a return address and a subroutine argument. That should not happen in `fork` call. Additionally, you will be calling into the middle of a routine, at a point which will not have the subroutine prologue that some platforms require. – Eric Postpischil Sep 19 '13 at 21:00
  • I see what you mean, but thats not really a problem because `fork()` doesn't take any parameters. Its a key factor here because it means theres no need to cleanup the stack after performing the `call` for it, meaning you can safely start a thread at this point as long you make sure the register `ESP` is preserved. And as you can see, `clone()` takes a `child_stack` parameter that seems to do exactly this. – Havenard Sep 19 '13 at 21:06
  • @Havenard: To implement a call to `fn`, `clone` must set up the child stack, then perform the call. So the return address and parameters are written after the stack pointer is initialized to the new stack. Additionally, the `call` semantics are wrong for making it look like a return from `fork`, since registers that were saved when `fork` started are not restored. – Eric Postpischil Sep 19 '13 at 22:00
  • It all can be fixed one way or another, doesn't really matter. – Havenard Sep 19 '13 at 22:25
  • Maybe not quite a duplicate although the question titles are almost identical the question contents are a bit different: http://stackoverflow.com/questions/18904292/is-it-true-that-fork-calls-clone-internally – Zan Lynx Oct 05 '15 at 22:51

2 Answers2

95

For questions like this, always read the source code.

From glibc's nptl/sysdeps/unix/sysv/linux/fork.c (GitHub) (nptl = native Posix threads for Linux) we can find the implementation of fork(), which is definitely not a syscall, we can see that the magic happens inside the ARCH_FORK macro, which is defined as an inline call to clone() in nptl/sysdeps/unix/sysv/linux/x86_64/fork.c (GitHub). But wait, no function or stack pointer is passed to this version of clone()! So, what is going on here?

Let's look at the implementation of clone() in glibc, then. It's in sysdeps/unix/sysv/linux/x86_64/clone.S (GitHub). You can see that what it does is it saves the function pointer on the child's stack, calls the clone syscall, and then the new process will read pop the function off the stack and then call it.

So it works like this:

clone(void (*fn)(void *), void *stack_pointer)
{
    push fn onto stack_pointer
    syscall_clone()
    if (child) {
        pop fn off of stack
        fn();
        exit();
    }
}

And fork() is...

fork()
{
    ...
    syscall_clone();
    ...
}

Summary

The actual clone() syscall does not take a function argument, it just continues from the return point, just like fork(). So both the clone() and fork() library functions are wrappers around the clone() syscall.

Documentation

My copy of the manual is somewhat more upfront about the fact that clone() is both a library function and a system call. However, I do find it somewhat misleading that clone() is found in section 2, rather than both section 2 and section 3. From the man page:

#include <sched.h>

int clone(int (*fn)(void *), void *child_stack,
          int flags, void *arg, ...
          /* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );

/* Prototype for the raw system call */

long clone(unsigned long flags, void *child_stack,
          void *ptid, void *ctid,
          struct pt_regs *regs);

And,

This page describes both the glibc clone() wrapper function and the underlying system call on which it is based. The main text describes the wrapper function; the differences for the raw system call are described toward the end of this page.

Finally,

The raw clone() system call corresponds more closely to fork(2) in that execution in the child continues from the point of the call. As such, the fn and arg arguments of the clone() wrapper function are omitted. Furthermore, the argument order changes.

pmod
  • 10,450
  • 1
  • 37
  • 50
Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
  • 5
    exacly: `fork()` doesn't call `clone()`, both are functions that use the syscall `clone`. – Javier Sep 20 '13 at 22:51
17

@Dietrich did a great job explaining by looking at the implementation. That's amazing! Anyway, there's another way of discovering that: by looking at the calls strace "sniffs".

We can prepare a very simple program that uses fork(2) and then check our hypothesis (i.e, that there's no fork syscall really happening).

#define WRITE(__fd, __msg) write(__fd, __msg, strlen(__msg))

int main(int argc, char *argv[])
{
  pid_t pid;

  switch (pid = fork()) {
    case -1:
      perror("fork:");
      exit(EXIT_FAILURE);
      break;
    case 0:
      WRITE(STDOUT_FILENO, "Hi, i'm the child");
      exit(EXIT_SUCCESS);
    default:
      WRITE(STDERR_FILENO, "Heey, parent here!");
      exit(EXIT_SUCCESS);
  }

  return EXIT_SUCCESS;
}

Now, compile that code ( clang -Wall -g fork.c -o fork.out ) and then execute it with strace:

strace -Cfo ./fork.strace.log ./fork.out

This will intercept system calls called by our process (with -f we also intercept the child's calls) and then put those calls into ./fork.trace.log; -c option gives us a summary at the end). The result in my machine (Ubuntu 14.04, x86_64 Linux 3.16) is (summarized):

6915  arch_prctl(ARCH_SET_FS, 0x7fa001a93740) = 0
6915  mprotect(0x7fa00188c000, 16384, PROT_READ) = 0
6915  mprotect(0x600000, 4096, PROT_READ) = 0
6915  mprotect(0x7fa001ab9000, 4096, PROT_READ) = 0
6915  munmap(0x7fa001a96000, 133089)    = 0
6915  clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fa001a93a10) = 6916
6915  write(2, "Heey, parent here!", 18) = 18
6916  write(1, "Hi, i'm the child", 17 <unfinished ...>
6915  exit_group(0)                     = ?
6916  <... write resumed> )             = 17
6916  exit_group(0)                     = ?
6915  +++ exited with 0 +++
6916  +++ exited with 0 +++
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 24.58    0.000029           4         7           mmap
 17.80    0.000021           5         4           mprotect
 14.41    0.000017           9         2           write
 11.02    0.000013          13         1           munmap
 11.02    0.000013           4         3         3 access
 10.17    0.000012           6         2           open
  2.54    0.000003           2         2           fstat
  2.54    0.000003           3         1           brk
  1.69    0.000002           2         1           read
  1.69    0.000002           1         2           close
  0.85    0.000001           1         1           clone
  0.85    0.000001           1         1           execve
  0.85    0.000001           1         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.000118                    28         3 total

As expected, no fork calls. Just the raw clone syscall with its flags, child stack and etc properly set.

Ciro Costa
  • 2,455
  • 22
  • 25