Why does process creation using `clone` result in an out-of-memory failure?

Question

I have a process that allocates about 20GB of RAM on a 32GB machine. After some events, I'm streaming the data from the parent process to stdin of the child process. It's mandatory to keep the 20GB of data in the parent process at the point when the child is spawned.

The app is written in Rust and I'm calling Command::new('path/to/command') to create the child process.

When I spawn the child process the operating system is trapping an out-of-memory error.

strace output:

[pid 747] 16:04:41.128377 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ff4c7f87b10) = -1 ENOMEM (Cannot allocate memory)

Why does the trap occur? The child process should not consume more than 1GB and exec() is called immediately after clone().

This might be an overcommit issue. Try executing `echo "1" >/proc/sys/vm/overcommit_memory` as the root user. — user4815162342, Mar 08 '17 at 15:43
You could always spawn the child early on in the processes lifetime and keep it around until you need it. — Shepmaster, Mar 08 '17 at 15:53
You should probably provide details such as your version of Rust, what OS and OS version you are using, etc. — Shepmaster, Mar 08 '17 at 15:55
@user4815162342 Can you add a detailed explanation what memory overcommitment is and what is does ? — Nextar, Mar 08 '17 at 16:08
Actually I've seen in other questions the "overcommit_memory", but for me it seems kind of problematic to change memory configurations without knowing what I'm actually chaning — Nextar, Mar 08 '17 at 16:11
@Nextar I will post an answer to that effect if that actually makes a difference in your case. Have you tried it to see if it helps? (Also, you can [google it](https://www.google.com/search?q=linux+overcommit+memory).) — user4815162342, Mar 08 '17 at 16:12
"It's mandatory to keep the data in the RAM." - So did you call `mlock()` or even `mlockall()`? — osgx, Mar 08 '17 at 16:33
@osgx No i didn't.But as the docs point out "The function mlockall() causes all of the pages mapped by the address space of a process to be memory resident until unlocked or until the process exits or >EXECS< another process image." And after the clone the Command::new should exec another process image. — Nextar, Mar 08 '17 at 16:59
Nextar, what was the value in the `/proc/sys/vm/overcommit_memory` file and similar named "overcommit" in the same dir when you got the error? — osgx, Mar 08 '17 at 18:00
could you provide 'free -m' output just before you launch your process, 'free -m' just before you're running the command and the output of 'cat /proc/sys/vm/overcommit_memory' and 'cat /proc/sys/vm/overcommit_ratio'? — Oleg Kuralenko, Mar 11 '17 at 14:14
One other thing to check is if you're running inside a memory cgroup or something else limiting resources. — ephemient, Mar 13 '17 at 05:50
Setting overcommit_memory to 1 fixes the issue and makes perfect sense to me. It would be great if someone can post an detailed (maybe with some docs related to overcommit_memory setting) answer for other people that have the same issue in the future. :) ! — Nextar, Mar 13 '17 at 09:21
@Shepmaster, I like your suggestion of spawning the child before the 20G allocation. The child could sit in a wait state until it is needed. Another step forward might be to put all the processing in children. Every time I try to do processing in the process, I have to change to a controller-only parent later. Now I just start that way. — Douglas Daseeco, Mar 13 '17 at 16:59
Regarding the overcommit_memory = ALWAYS, we may use that kind of hack to get through a big data task and meet a deadline, but we'd add going back to fix the root cause to our Agile backlog with a high priority so it gets done right before we forget what we did and get some strange bug that takes forever to correlate back to the unconditional overcommit. — Douglas Daseeco, Mar 13 '17 at 17:02
@FauChristian yep, I've worked at a place where we had to deal with spawning arbitrary child processes *and* we used multithreading. We quickly created a system where we spawned a helper before anything. That helper did basically nothing but spawn further children, all communicating through pipes. — Shepmaster, Mar 13 '17 at 17:20
@Shepmaster, exactly. The executable starts a child which then detaches from the parent with close and wait calls so that the parent executable exits normally and no zombie is created. The child then creates pipes and dups them before forking and execvp-ing grandchildren, used where process independence benefits reliability or throughput. Whereas pthread_create can be used for convenience where that's not as much a concern. The pattern is so consistently successful in GNU projects and our laboratory work that I've considered creating a C++ template Daemon. — Douglas Daseeco, Mar 13 '17 at 17:48

Douglas Daseeco · Accepted Answer · 2017-03-13T16:42:04.657

The Problem

When a child process is created by the Rust call, several things happen at a C/C++ level. This is a simplification, but it will help explain the dilemma.

The streams are duplicated (with dup2 or a similar call)
The parent process is forked (with the fork or clone system call)
The forked process executes the child (with call from the execvp family)

The parent and child are now concurrent processes. The Rust call you are currently using appears to be a clone call that is behaving much like a pure fork, so you're 20G x 2 - 32G = 8G short, without considering the space needed by the operating system and anything else that might be running. The clone call is returning with a negative return value and errno is set by the call to ENOMEM errno.

If the architectural solutions of adding physical memory, compressing the data, or streaming it through a process that does not require the entirety of it to be in memory at any one time are not options, then the classic solution is reasonably simple.

Recommendation

Design the parent process to be lean. Then spawn two worker children, one that handles your 20GB need and the other that handles your 1 GB need¹. These children can be connected to one another via pipe, file, shared memory, socket, semaphore, signalling, and/or other communication mechanism(s), just as a parent and child can be.

Many mature software packages from Apache httpd to embedded cell tower routing daemons use this design pattern. It is reliable, maintainable, extensible, and portable.

The 32G would then likely suffice for the 20G and 1G processing needs, along with OS and lean parent process.

Although this solution will surely solve your problem, if the code is to be reused or extended later, there may be value in looking into potential process design changes involving data frames or multidimensional slices to support streaming of data and memory requirement reductions.

Memory Overcommit Always

Setting overcommit_memory to 1 eliminates the clone error condition referenced in the question because the Rust call calls the LINUX clone call that reads that setting. But there are several caveats with this solution that point back to the above recommendation as superior, primarily that the value of 1 is dangerous, especially for production environments.

Background

Kernel discussions about OpenBSD rfork and the clone call ensued in the late 1990s and early 2000s. The features stemming from those discussions permit less extreme forking than processes, which is symmetrically like the provision of more extensive independence between pthreads. Some of these discussions have produced extensions to the traditional process spawning that have entered POSIX standardization.

In the early 2000s, Linux Torvalds suggested a flag structure to determine what components of the execution model are shared and what are copied when execution forks, blurring the distinction between processes and threads. From this, the clone call emerged.

Over-committing memory is not discussed much if any in those threads. The design goal was MORE control of the results of a fork rather than the delegation of memory usage optimization to an operating system heuristic, which is what the default setting of overcommit_memory = 0 does.

Caveats

Memory overcommit goes beyond these extensions, adding the complexity of trade-offs of its modes², design trend caveats³, practical run time limitations⁴, and performance impacts⁵.

Portability and Longevity

Additionally, without standardization, the code using memory overcommit may not be portable, and the question of longevity is pertinent, especially when a setting controls the behavior of a function. There is no guarantee of backward compatibility or even some warning of deprication if the setting system changes.

Danger

The linuxdevcenter documentation² says, "1 always overcommits. Perhaps you now realize the danger of this mode.", and there are other indications of danger with ALWAYS overcommitting ^{6, 7}.

The implementers of overcommit on LINUX, Windows, and VMWare may guarantee reliability, but it is a statistical game that, combined with the many other complexities of process control, may lead to certain unstable characteristics under certain conditions. Even the name overcommit tells us something about its true character as a practice.

A non-default overcommit_memory mode, for which several warnings are issues, but works for the immediate trial of the immediate case may later lead to intermittent reliability.

Predictability and Its Impact on System Reliability and Response Time Consistency

The idea of a process in a UNIX like operating system, from its Bell Labs beginnings, is that a process makes a concrete requests to its container, the operating system. The result both predictable and binary. Either the request is denied or granted. Once granted, the process is given complete control and direct access over the resources until the use of it is relinquished by the process.

The swap space aspect of virtual memory is a breach of this principle that appears as gross deceleration of activity on workstations, when RAM is heavily consumed. For instance, there are times during development when one presses a key and has to wait ten seconds to see the character on the display.

Conclusion

There are many ways to get the most out of physical memory, but doing so by hoping that use of memory allocated will be sparse will likely introduce negative impacts. Performance hits from swapping when overcommit is overused is the well documented example. If you are keeping 20G of data in RAM, this may particularly be the case.

Only allocating what is needed, forking in intelligent ways, using threads, and freeing memory that is surely no longer needed lead to memory thrift without impacting reliability, creating spikes in swap disk usage, and can operate without caveat up to the limits of system resources.

The position of the designer of the Command::new call may be based on this perspective. In this case, how soon after the fork the exec is called is not a determining factor in how much memory is requested during the spawn.

Notes and References

[1] Spawning worker children may require some code refactoring and appear to be too much trouble on a superficial level, but the refactoring may be surprisingly straightforward and significantly beneficial.

[2] http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html?page=2

[3] https://www.etalabs.net/overcommit.html

[4] http://www.gabesvirtualworld.com/memory-overcommit-in-production-yes-yes-yes/

[5] https://labs.vmware.com/vmtj/memory-overcommitment-in-the-esx-server

[6] https://github.com/kubernetes/kubernetes/issues/14452

[7] http://linuxtoolkit.blogspot.com/2011_08_01_archive.html

Is there COW mechanism in the step 2? What do you meant by "memory is doubled"? Which memory, physical or virtual? What is the setting of overcommit_memory on PC with the problem? — osgx, Mar 13 '17 at 02:28
"When virtual memory is allocated, it must correspond to physical memory space" - not it don't. There is overcommit in Linux, enabled by default: 9.6 Overcommit and OOM of https://www.win.tue.nl/~aeb/linux/lk/lk-9.html & http://stackoverflow.com/questions/38688824/. — osgx, Mar 13 '17 at 03:15
Not. Linux usually allow to allocate more virtual memory than physical memory size (RAM+swap). This memory is not mapped to something real, on access to every page there will be interrupt of "Page Fault" and kernel will allocate real physical page, install correct mapping and restart failed instruction. When there will be no phys mem to allocate from there will be OOM. This can be turned off in overcommit_memory (and there are possible more complex tasks with THP). Please, add references to your answer, read some docs and try not to post when you do not understand the problem fully. — osgx, Mar 13 '17 at 03:54
And on fork parent memory will be COWed: made shared between parent and child, but remarked as read-only. On first write access to every touched page (from any process) there will be page fault again with allocating of phys mem for copy and unsharing the page (COW = Copy-On-Write). So, fork will consume mem for ptes & vma tables (to create new virtual mapping for child), and willl do some overcommit heuristic accounting, but it will not allocate 20 GB phys at fork. — osgx, Mar 13 '17 at 03:58
The memory is only copied when touched, so it is COW. But the VM accounting has to allow for the fact that the user might touch all of it. The overcommit logic in the kernel defaults to setting 2, not 1. Setting 2 is heuristic overcommit, which means the kernel allows *some* overcommit but will reject really excessive amounts of it. — Zan Lynx, Mar 13 '17 at 05:24
Actually I'm looking more for the answer @osgx provided. With COW the 20GB shouldn't be needed. But I Didn't know about memory_overcommit. So if memory_overcommit is set to 1 it makes perfekt sense to me why the clone() is not working — Nextar, Mar 13 '17 at 09:18
@Nextar, just do the `cat /proc/sys/vm/overcommit_memory` to see what is it, when the failing program in question fails. Please, post output of `free` command too. — osgx, Mar 13 '17 at 14:28
@DouglasDaseeco The process is the only user process running on the system, actually there are multiple projects for each a new ec2 is created. I also like the idea of creating the process at the beginn of the main process and let it wait :) — Nextar, Mar 14 '17 at 08:40

Why does process creation using `clone` result in an out-of-memory failure?

1 Answers1