This has been asked many times, but none of the answers helped. After hours of digging, I am turning here for help. I am a developer with limited sysadmin experience, but because our ops person left, I am left to try and keep things alive.
On one of our sites we recently started randomly getting 502 errors. This happens fairly regularly, at least dozen times a day (as reported by nagios and sometimes our users). I am not aware of any configuration changes. The web stack is standard - nginx server proxying requests to php-fpm, which runs a wordpress-based app.
The nginx error log contains a lot of messages like this:
[error] 31180#31180: *451395 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: x.x.x.x, server: x.x, request: "GET /x/x/ HTTP/1.0", upstream: "fastcgi://127.0.0.1:9000", host: "x.x.x"
Most of them come from a client IP that is the IP of the server itself (not sure why, maybe some monitoring?), but there are errors from random public IPs as well.
The PHP-FPM log gives warnings like this approximately every hour:
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 0 idle, and 71 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 0 idle, and 75 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 79 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 83 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 87 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 91 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 95 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 99 total children
WARNING: [pool www] server reached pm.max_children setting (100), consider raising it
Things I have tried
Rebooting
Obvious, but did not help at all.
Increasing resources, PHP-FPM children processes
- Increasing available RAM, CPU did not help. The disk is not full, inodes are not fully used.
- With the increased resources, I set the
pm.max_children
up to 100. It was originally 40, and that was okay for years of operation. After I saw the log, I tried turning it up to 75, then 100. - Another website with several times more visitors has less hardware and works just fine. This website is not serving any difficult content, mostly just blogs.
For sake of completion, the FPM configuration looks like this:
pm.max_children = 100 pm.start_servers = 24 pm.min_spare_servers = 4 pm.max_spare_servers = 64 pm.max_requests = 500
There are no mentions in logs about running OOM either.
Investigating opcache
I read that opcache running out of memory could be the culprit. Alas, it has memory to spare:
Cache hits 89757614 Cache misses 1174 Used memory 58333696 Free memory 75884032 Wasted memory 0 OOM restarts 0
Nginx timeouts
Nginx parameters should not be the issue, as the buffer and timeout values seem to be pretty generous (I assume the unit of 3000 is seconds):
client_header_timeout 3000; client_body_timeout 3000; fastcgi_read_timeout 3000; fastcgi_buffers 16 16k; fastcgi_buffer_size 32k;
Other info
- PHP-FPM is not crashing, nothing is in its logs beside warnings about the children
- xdebug is disabled
- syslog,dmesg does not contain any relevant messages
- php7.0, nginx 1.12.2
Is there anything else I can try?