5

On a 54-core machine, I use os.Exec() to spawn hundreds of client processes, and manage them with an abundance of goroutines.

Sometimes, but not always, I get this:

runtime: failed to create new OS thread (have 1306 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

My ulimit is pretty high already:

$ ulimit -u
1828079

There's never a problem if I limit myself to, say, 54 clients.

Is there a way I can handle this situation more gracefully? E.g. not bomb out with a fatal error, and just do less/delayed work instead? Or query the system ahead of time and anticipate the maximum amount of stuff I can do (I don't just want to limit to the number of cores though)?

Given my large ulimit, should this error even be happening? grep -c goroutine on the stack output following the fatal error only gives 6087. Each client process (of which there are certainly less than 2000) might have a few goroutines of their own, but nothing crazy.

Edit: the problem only occurs on high-core machines (~60). Keeping everything else constant and just changing the number of cores down to 30 (this being an OpenStack environment, so the same underlying hardware still being used), these runtime errors don't occur.

sbs
  • 1,139
  • 3
  • 13
  • 19
  • 3
    Interesting situation. BTW, counting goroutines isn't going to help solve your problem, as goroutines are not directly related to threads. You could have (theoretically) millions of goroutines, and only a single OS thread. – Jonathan Hall Aug 19 '19 at 13:32
  • Have you tried modifying the `GOMAXPROVS` environment variable? – guzmonne Aug 19 '19 at 15:17
  • See also https://stackoverflow.com/q/46484627/1256452 – torek Aug 19 '19 at 15:28
  • 1
    Update (replaces an earlier comment): having had a look at the runtime source (see previous link and its links), you're definitely getting EAGAIN errors. What's not clear to me is why you are getting another thread when you have 1306 already, nor why your system is running out at 1306—presumably you're hitting a system-wide limit. See the fork manpage and `cat /proc/sys/kernel/pid_max` and `/proc/sys/kernel/threads-max`; you might be hitting a lower limit from a cgroup `pids.max`, too. – torek Aug 19 '19 at 15:39
  • Hm, if your client processes manage goroutines themselves, why not refactor the client processes to goroutines? – Markus W Mahlberg Aug 20 '19 at 05:15
  • @MarkusWMahlberg Client processes can also run on different machines, so can't refactor them. – sbs Aug 20 '19 at 09:18
  • @torek pid_max is 55296, threads-max is 3656158, and /sys/fs/cgroup/pids/user.slice/user-1000.slice/pids.max is 18247. – sbs Aug 21 '19 at 08:35
  • @guzmonne setting GOMAXPROCS=1000 (under the number of threads it tries to open when it fails) seems to avoid the runtime error, but the app starts misbehaving and doing strange things probably related to not responding to some clients in time. – sbs Aug 21 '19 at 08:47
  • @torek Different attempts fail after having different numbers of threads already, eg. another failed with `runtime: failed to create new OS thread (have 1291 already; errno=11)`. The issue might actually be hardware related, since trying the same with a different CPU (OS and software identical; this is in OpenStack) works fine. – sbs Aug 21 '19 at 08:49
  • And, do you have to create new threads to handle each client? Can't you use `goroutines`? – guzmonne Aug 21 '19 at 15:09
  • I'm not all that familiar with the internals of the go runtime, and I wonder if it automatically spins off new threads (despite GOMAXPROCS defaulting to the number of cpus) when making various blocking system calls. – torek Aug 21 '19 at 16:10
  • @guzmonne I'm not creating threads myself; the go runtime is automatically doing that in response to me creating goroutines. – sbs Aug 23 '19 at 20:26
  • @torek GOMAXPROCS doesn't have a default I think. It does seem to be a problem related to running out of a resource in the Operating System. Testing with a different OS, I got the server staying alive, but a manually executed client failing after only having 3 threads already. I guess the go runtime just doesn't handle this situation gracefully? – sbs Aug 23 '19 at 20:29
  • @sbs: `runtime/debug.go` can change the current (internal variable) `gomaxprocs` via [`GOMAXPROCS()`](https://golang.org/pkg/runtime/#GOMAXPROCS), but the default *is* the number of CPUs. But `runtime/extern.go` says: "There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit." So, yes, probably you have a lot of goroutines that are in blocking system calls. As a workaround you might have these goroutines use a counting semaphore to limit the number of syscall waiters. – torek Aug 23 '19 at 20:52
  • 1
    The trick will be figuring out what, in your code or libraries you use, is making blocking system calls. The Linux `strace` utility might be useful here. – torek Aug 23 '19 at 20:58
  • @sbs as explained on https://pkg.go.dev/runtime/debug#SetMaxThreads: "A Go program creates a new thread only when a goroutine is ready to run but all the existing threads are blocked in system calls, cgo calls, or are locked to other goroutines due to use of runtime.LockOSThread", and as pointed out above GOMAXPROCS (that defaults to the number of cores on your machine) does not apply in those cases. Your program is probably starting an unbounded number of goroutines that then block in a syscall somewhere, e.g. by reading from disk, or in other blocking operations. – CAFxX Jul 18 '21 at 09:28

0 Answers0