2

I have this code:

#!/bin/bash
pids=()
for i in $(seq 1 999); do
  sleep 1 &
  pids+=( "$!" )
done
for pid in "${pids[@]}"; do
  wait "$pid"
done

I expect the following behavior:

  • spin through the first loop
  • wait about a second on the first pid
  • spin through the second loop

Instead, I get this error:

./foo.sh: line 8: wait: pid 24752 is not a child of this shell

(repeated 171 times with different pids)

If I run the script with shorter loop (50 instead of 999), then I get no errors.

What's going on?

Edit: I am using GNU bash 4.4.23 on Windows.

Dragon
  • 173
  • 10
  • I don't know why it's not working, but for what it's worth, `wait` with no arguments will wait for all child processes, and `wait -n` will wait for any child. – John Kugelman Feb 06 '23 at 05:46
  • 2
    Can't be sure, but sounds like forking a thousand processes is taking more than a second, so the early ones are already done when you hit the corresponding wait. You could check this with a longer sleep. Since 50 works with 1 second, 20 seconds ought to work for a thousand. – Gene Feb 06 '23 at 05:58
  • 1
    It shouldn't matter how long the children take. – John Kugelman Feb 06 '23 at 06:10
  • Added version info. Switching to a 20s sleep creates a new error: `./foo.sh: fork: retry: Resource temporarily unavailable`, probably due to hitting a system limit. I thought `wait` produced the return code of even completed jobs. If I do `echo foo &` and `wait $!` then it works even with ~60s between commands. So the fact that `sleep` has already completed shouldn't affect anything, right? – Dragon Feb 06 '23 at 06:12
  • 1
    what do you exactly mean when you say "GNU bash 4.4.23 on Windows"? cygwin? windows subsystem for linux? or ...? – pynexj Feb 06 '23 at 08:47
  • This looks like the same problem: ["pid X is not a child of this shell" error reported by wait after spawning more than 545 tasks when executing script in docker](https://stackoverflow.com/q/61637874/4154375). It doesn't appear to have been resolved. It looks like running large numbers of background processes causes Bash to malfunction in some circumstances. – pjh Feb 06 '23 at 17:41
  • 1
    I see exactly the same problem when I run the code on Cygwin with Bash 4.4.12. – pjh Feb 06 '23 at 17:48
  • 1
    The code in the question works with Bash 5.1.16 on my Ubuntu 22.04 VM, but it fails with the `pid XXX is not a child of this shell` error if I increase the number of background processes from 999 to 5000. – pjh Feb 06 '23 at 18:00
  • Is this an interactive or noninteractive shell? (Does it have job control enabled?) -- you'll get more predictable behavior if this is in a script, not run in an interactive shell. – Charles Duffy Feb 06 '23 at 20:25
  • I can reproduce with `docker run -ti --rm bash:4` on archlinux and big enough number `5000`. I can reproduce with `bash:5.0.18` but can't with `bash:5.1.0`. From https://github.com/bminor/bash/blob/bash-5.1/CHANGES : `Make sure SIGCHLD is blocked in all cases where waitchld() is not called from a signal handler.` maybe – KamilCuk Feb 06 '23 at 21:42
  • I can more consistently reproduce with `true &`. `docker run -ti --rm -v $PWD:/mnt:ro -w /mnt bash:5.0.18 bash -c 'pids=; for ((i=1;i<550;++i)); do true & pids+=" $!"; done; wait $pids'` is the shortest I come up with. – KamilCuk Feb 06 '23 at 22:01
  • I am running this from a script (`foo.sh`). Job control should be disabled, but I tried with both `set -m` and `set +m` and get the same error(s) both ways. As for the how I'm running bash: I don't have WSL or Cygwin, this version of bash came with when I installed git (iirc). – Dragon Feb 07 '23 at 01:35

3 Answers3

3

The way you might reasonably expect this to work, as it would if you wrote a similar program in most other languages, is:

  1. sleep is executed in the background via a fork+exec.
  2. At some point, sleep exits leaving behind a zombie.
  3. That zombie remains in place, holding its PID, until its parent calls wait to retrieve its exit code.

However, shells such as bash actually do this a little differently. They proactively reap their zombie children and store their exit codes in memory so that they can deallocate the system resources those processes were using. Then when you wait the shell just hands you whatever value is stored in memory, but the zombie could be long gone by then.

Now, because all of these exit statuses are being stored in memory, there is a practical limit to how many background processes can exit without you calling wait before you've filled up all the memory you have available for this in the shell. I expect that you're hitting this limit somewhere in the several hundreds of processes in your environment, while other users manage to make it into the several thousands in theirs. Regardless, the outcome is the same - eventually there's nowhere to store information about your children and so that information is lost.

tjm3772
  • 2,346
  • 2
  • 10
  • "Proactively reap their zombie children" -- true, _when job control is enabled_, which it isn't for scripts. (If the OP is testing this in their interactive shell, that would explain the problem). – Charles Duffy Feb 06 '23 at 20:26
  • I don't think job control is the deciding factor. I can write a C program that forks+execs into a `/bin/sleep 30` and watch its status go to Z in the process table until the program eventually ends, but with an equivalent bash script the sleep simply disappears from the process table even with the only set options being `hB`. (both cases use a `sleep 60` to keep the program alive after the inital sleep is backgrounded to observe the behavior of the children) – tjm3772 Feb 06 '23 at 21:15
  • My experience also is that non-interactive shells, without job control enabled, proactively reap their zombie children. In fact, I've found it to be tricky to get a bash program to create zombie processes. – pjh Feb 06 '23 at 23:03
  • Similar topics come up in [How can I reproduce zombie process with bash as PID1 in docker?](https://stackoverflow.com/questions/37022611/how-can-i-reproduce-zombie-process-with-bash-as-pid1-in-docker). – pjh Feb 06 '23 at 23:15
3

POSIX says:

The implementation need not retain more than the {CHILD_MAX} most recent entries in its list of known process IDs in the current shell execution environment.

{CHILD_MAX} here refers to the maximum number of simultaneous processes allowed per user. You can get the value of this limit using the getconf utility:

$ getconf CHILD_MAX
13195

Bash stores the statuses of at most twice as that many exited background processes in a circular buffer, and says not a child of this shell when you call wait on the PID of an old one that's been overwritten. You can see how it's implemented here.

oguz ismail
  • 1
  • 16
  • 47
  • 69
  • 1
    This answer seems the most likely cause. Interestingly, the number of errors I get isn't consistent. My CHILD_MAX is 256, but I can run up to ~700 bg processes without errors. Possibly this is due to the hash table where results are stored. Thank you! – Dragon Feb 08 '23 at 03:14
1

I can reproduce on ArchLinux with docker run -ti --rm bash:5.0.18 bash -c 'pids=; for ((i=1;i<550;++i)); do true & pids+=" $!"; done; wait $pids' and any earlier. I can't reproduce with bash:5.1.0 .

What's going on?

It looks like a bug in your version of Bash. There were a couple of improvements in jobs.c and wait.def in Bash:5.1 and Make sure SIGCHLD is blocked in all cases where waitchld() is not called from a signal handler is mentioned in the changelog. From the look of it, it looks like an issue with handling a SIGCHLD signal while already handling another SIGCHLD signal.

KamilCuk
  • 120,984
  • 8
  • 59
  • 111
  • 1
    I agree that this is a Bash bug (upvoted), but I don't think that it is (fully) fixed even in Bash 5.1. I can reproduce the error with 5000 background processes using Bash 5.1.16 on a Ubuntu 22.04 VM. That (to address Charles Duffy's reasonable concerns) was with a non-interactive shell (`bash testprog`). – pjh Feb 06 '23 at 22:57