Faster forking of large processes on Linux?

Question

What's the fastest, best way on modern Linux of achieving the same effect as a fork-execve combo from a large process ?

My problem is that the process forking is ~500MByte big, and a simple benchmarking test achieves only about 50 forks/s from the process (c.f ~1600 forks/s from a minimally sized process) which is too slow for the intended application.

Some googling turns up vfork as having being invented as the solution to this problem... but also warnings about not to use it. Modern Linux seems to have acquired related clone and posix_spawn calls; are these likely to help ? What's the modern replacement for vfork ?

I'm using 64bit Debian Lenny on an i7 (the project could move to Squeeze if posix_spawn would help).

That document is 7 years old. That's several Internet lifetimes. — Ignacio Vazquez-Abrams, Apr 29 '10 at 08:46

score 41 · Answer 1 · edited Aug 06 '20 at 15:37

41

On Linux, you can use posix_spawn(2) with the POSIX_SPAWN_USEVFORK flag to avoid the overhead of copying page tables when forking from a large process.

See Minimizing Memory Usage for Creating Application Subprocesses for a good summary of posix_spawn(2), its advantages and some examples.

To take advantage of vfork(2), make sure you #define _GNU_SOURCE before #include <spawn.h> and then simply posix_spawnattr_setflags(&attr, POSIX_SPAWN_USEVFORK)

I can confirm that this works on Debian Lenny, and provides a massive speed-up when forking from a large process.

benchmarking the various spawns over 1000 runs at 100M RSS
                            user     system      total        real
fspawn (fork/exec):     0.100000  15.460000  40.570000 ( 41.366389)
pspawn (posix_spawn):   0.010000   0.010000   0.540000 (  0.970577)

edited Aug 06 '20 at 15:37

DavidJ

4,369
4
26
42

answered Mar 01 '11 at 11:12

tmm1

2,025
1
20
35

Thanks; this does actually answer the question about a modern equivalent of vfork. In fact, I really didn't want the spawned process to share VM pages (I mean, if that was OK I'd have just created a thread) and using huge pages or an early-spawned helper turned out to be a better solution. – timday Mar 01 '11 at 14:33
9

The pages are shared only up until `execve(2)` is called, so the spawned process itself will not share any memory with the parent process. This basically avoids copying the page table which is what makes `fork(2)` slow in the first place, because the new fork is simply going to `execve(2)` and does not actually need a copy of the parent's memory. – tmm1 Mar 01 '11 at 20:56
Ah OK yes I get it; sorry, had to reread my original question and think back a bit to remember this is actually what I was looking for at the time. – timday Mar 02 '11 at 21:29
7

You don't need to specify `POSIX_SPAWN_USEVFORK`; glibc's `posix_spawn` has an heuristic that will automatically use `vfork` under the covers if it is safe to do so. (And you really *don't* want to use it if it isn't safe to do so.) – Glyph Jun 14 '12 at 06:34

score 17 · Accepted Answer · answered May 20 '10 at 13:15

Outcome: I was going to go down the early-spawned helper subprocess route as suggested by other answers here, but then I came across this re using huge page support to improve fork performance.

Having tried it myself using libhugetlbfs to simply make all my app's mallocs allocate huge pages, I'm now getting around 2400 forks/s regardless of the process size (over the range I'm interested in anyway). Amazing.

P Shved · Answer 3 · 2010-04-29T08:41:06.250

Did you actually measure how much time forks take? Quoting the page you linked,

Linux never had this problem; because Linux used copy-on-write semantics internally, Linux only copies pages when they changed (actually, there are still some tables that have to be copied; in most circumstances their overhead is not significant)

So the number of forks doesn't really show how big the overhead will be. You should measure the time consumed by forks, and (which is a generic advice) consumed only by the forks you actually perform, not by benchmarking maximum performance.

But if you really figure out that forking a large process is a slow, you may spawn a small ancillary process, pipe master process to its input, and receive commands to exec from it. The small process will fork and exec these commands.

posix_spawn()

This function, as far as I understand, is implemented via fork/exec on desktop systems. However, in embedded systems (particularly, in those without MMU on board), processes are spawned via a syscall, interface to which is posix_spawn or a similar function. Quoting the informative section of POSIX standard describing posix_spawn:

Swapping is generally too slow for a realtime environment.

Dynamic address translation is not available everywhere that POSIX might be useful.

Processes are too useful to simply option out of POSIX whenever it must run without address translation or other MMU services.

Thus, POSIX needs process creation and file execution primitives that can be efficiently implemented without address translation or other MMU services.

I don't think that you will benefit from this function on desktop if your goal is to minimize time consumption.

Yes I was under the impression "Linux never had this problem" too... until I actually benchmarked it and got the numbers I quote above. Presumably copying those tables (which I believe are VM page tables) takes quite a while when your process is 500MByte big. — timday, Apr 28 '10 at 18:30
+1, I implemented this as an 'exec' helper to a single threaded non-blocking server. Instead of blocking in a fork() / execv(), I simply pipe the request to the helper, then flag the connection as waiting_for_exec_result, then do useful work while waiting for the data to be available to send back to the client. — Tim Post, Apr 29 '10 at 08:27

score 5 · Answer 4 · answered Apr 28 '10 at 18:02

If you know the number of subprocess ahead of time, it might be reasonable to pre-fork your application on startup then distribute the execv information via a pipe. Alternatively, if there is some sort of "lull" in your program it might be reasonable to fork ahead of time a subprocess or two for quick turnaround at a later time. Neither of these options would directly solve the problem but if either approach is suitable to your app, it might allow you to side-step the issue.

Sam Liddicott · Answer 5 · 2016-02-09T15:58:28.443

I've come across this blog post: http://blog.famzah.net/2009/11/20/a-much-faster-popen-and-system-implementation-for-linux/

pid = clone(fn, stack_aligned, CLONE_VM | SIGCHLD, arg);

Excerpt:

The system call clone() comes to the rescue. Using clone() we create a child process which has the following features:

The child runs in the same memory space as the parent. This means that no memory structures are copied when the child process is created. As a result of this, any change to any non-stack variable made by the child is visible by the parent process. This is similar to threads, and therefore completely different from fork(), and also very dangerous – we don’t want the child to mess up the parent.

The child starts from an entry function which is being called right after the child was created. This is like threads, and unlike fork().

The child has a separate stack space which is similar to threads and fork(), but entirely different to vfork().

The most important: This thread-like child process can call exec().

In a nutshell, by calling clone in the following way, we create a child process which is very similar to a thread but still can call exec():

However I think it may still be subject to the setuid problem:

http://ewontfix.com/7/ "setuid and vfork"

Now we get to the worst of it. Threads and vfork allow you to get in a situation where two processes are both sharing memory space and running at the same time. Now, what happens if another thread in the parent calls setuid (or any other privilege-affecting function)? You end up with two processes with different privilege levels running in a shared address space. And this is A Bad Thing.

Consider for example a multi-threaded server daemon, running initially as root, that’s using posix_spawn, implemented naively with vfork, to run an external command. It doesn’t care if this command runs as root or with low privileges, since it’s a fixed command line with fixed environment and can’t do anything harmful. (As a stupid example, let’s say it’s running date as an external command because the programmer couldn’t figure out how to use strftime.)

Since it doesn’t care, it calls setuid in another thread without any synchronization against running the external program, with the intent to drop down to a normal user and execute user-provided code (perhaps a script or dlopen-obtained module) as that user. Unfortunately, it just gave that user permission to mmap new code over top of the running posix_spawn code, or to change the strings posix_spawn is passing to exec in the child. Whoops.

Is using `CLONE_VM` without `CLONE_VFORK` safe to use? It seems to me it is hard to get this right. How do you prevent messing up the memory space of the parent process in the child process? Maybe only making direct system calls works, but calling into glibc seems problematic. On my machine, glibc's `clone()` still is much much faster than `fork()` even without `CLONE_VM`. — Ton van den Heuvel, Jul 22 '20 at 09:35

Faster forking of large processes on Linux?

5 Answers5

posix_spawn()

Linked