Nginx+PHP-FPM occasionally returning 502

Question

This has been asked many times, but none of the answers helped. After hours of digging, I am turning here for help. I am a developer with limited sysadmin experience, but because our ops person left, I am left to try and keep things alive.

On one of our sites we recently started randomly getting 502 errors. This happens fairly regularly, at least dozen times a day (as reported by nagios and sometimes our users). I am not aware of any configuration changes. The web stack is standard - nginx server proxying requests to php-fpm, which runs a wordpress-based app.

The nginx error log contains a lot of messages like this:

[error] 31180#31180: *451395 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: x.x.x.x, server: x.x, request: "GET /x/x/ HTTP/1.0", upstream: "fastcgi://127.0.0.1:9000", host: "x.x.x"

Most of them come from a client IP that is the IP of the server itself (not sure why, maybe some monitoring?), but there are errors from random public IPs as well.

The PHP-FPM log gives warnings like this approximately every hour:

WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 0 idle, and 71 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 0 idle, and 75 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 79 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 83 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 87 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 91 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 95 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 99 total children
WARNING: [pool www] server reached pm.max_children setting (100), consider raising it

Things I have tried

Rebooting

Obvious, but did not help at all.

Increasing resources, PHP-FPM children processes

Increasing available RAM, CPU did not help. The disk is not full, inodes are not fully used.
With the increased resources, I set the pm.max_children up to 100. It was originally 40, and that was okay for years of operation. After I saw the log, I tried turning it up to 75, then 100.
Another website with several times more visitors has less hardware and works just fine. This website is not serving any difficult content, mostly just blogs.

For sake of completion, the FPM configuration looks like this:

pm.max_children = 100
pm.start_servers = 24
pm.min_spare_servers = 4
pm.max_spare_servers = 64
pm.max_requests = 500

There are no mentions in logs about running OOM either.

Investigating opcache

I read that opcache running out of memory could be the culprit. Alas, it has memory to spare:

Cache hits  89757614
Cache misses    1174
Used memory 58333696
Free memory 75884032
Wasted memory   0
OOM restarts    0

Nginx timeouts

Nginx parameters should not be the issue, as the buffer and timeout values seem to be pretty generous (I assume the unit of 3000 is seconds):
```
client_header_timeout 3000;
client_body_timeout 3000;
fastcgi_read_timeout 3000;
fastcgi_buffers 16 16k;
fastcgi_buffer_size 32k;
```

Other info

PHP-FPM is not crashing, nothing is in its logs beside warnings about the children
xdebug is disabled
syslog,dmesg does not contain any relevant messages
php7.0, nginx 1.12.2

Is there anything else I can try?

Links to stuff that did not work

Were you able to carry out some PHP side investigation/ analysis on why it keeps throwing `server reached pm.max_children setting (100), consider raising it`. Maybe child process(es) taking more time, threads being blocked etc causing new children to spawn to handle new requests ? — ben5556, Oct 15 '18 at 20:56
I did not look into that yet. Do you have any suggestions where to start here? — Martin Melka, Oct 16 '18 at 08:23