1

I run parallel like this, abstracting out some of the details:

generate_job_list | parallel -j10 -q bash -c 'echo -n "running {}" ; dostuff {}'

I've noticed that sometimes the child processes that parallel spawns die having received SIGKILL (I know because dostuff is a psql command to run a vacuum and the Postgres logs tell me the command received SIGKILL). I don't have a timeout set, so it's not clear to me what would possibly do something like that. This happens after the child process has been running for hours.

Does parallel have a default timeout (docs don't seem to suggest it does) or any other ideas on what could be causing this?

ETA: Add some stuff that helped me find this in the body of the question because it might help others who are having the same problem find this question.

In your Postgres logs you should find some messages like this:

LOG:  received smart shutdown request
LOG:  autovacuum launcher shutting down
FATAL:  the database system is shutting down

that will have been generated despite you not asking Postgres to shut down.

Gordon Seidoh Worley
  • 7,839
  • 6
  • 45
  • 82
  • 2
    If the SIGKILL was reported in the Postgres log, then it's not the `parallel` process which died, it's the server process which was handling the connection. Sounds to me like it was taken out by the OS's out-of-memory (OOM) killer. – Nick Barnes Jan 24 '19 at 22:04
  • possibly, but this is on a system with 512 GB of RAM, although according to monitoring at time of death the system had "only" 330 GB unused by processes – Gordon Seidoh Worley Jan 24 '19 at 22:09
  • 1
    Well, I can't think of much else that would randomly SIGKILL a server backend... If it was the OOM killer, it should have [logged something](https://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer) – Nick Barnes Jan 24 '19 at 22:18
  • yep, looks like you're right. here's the smoking gun from syslog: `postgres invoked oom-killer: gfp_mask=0x26080c0, order=2, oom_score_adj=0` – Gordon Seidoh Worley Jan 24 '19 at 22:20
  • ah, looks like the answer is probably do deal better with memory overcommit: https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT – Gordon Seidoh Worley Jan 24 '19 at 22:25
  • What's your [`maintenance_work_mem`](https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-MAINTENANCE-WORK-MEM) set to? It might just be configured a bit too aggressively to cope with ten vacuums in parallel. – Nick Barnes Jan 25 '19 at 12:26
  • 2GB (I know this seems high but some of these vacuums are very large (100mm+ rows in the table) and this has seemed to help them complete faster) – Gordon Seidoh Worley Jan 25 '19 at 17:52

1 Answers1

0

So as mentioned in comments, the problem was the OOM killer. I fixed it by doing a couple things:

  • partition tables that were effectively too big to vacuum without hitting memory issues
  • change memory overcommit mode to 2 and set overcommit ratio to 95
  • change autovacuum to be more aggressive so I don't have to run as many manual maintenance tasks, and this is better since if autovacuum fails it runs not in a regular transaction so if it fails it doesn't force a long recovery
Gordon Seidoh Worley
  • 7,839
  • 6
  • 45
  • 82