4

This has been asked many times, but none of the answers helped. After hours of digging, I am turning here for help. I am a developer with limited sysadmin experience, but because our ops person left, I am left to try and keep things alive.

On one of our sites we recently started randomly getting 502 errors. This happens fairly regularly, at least dozen times a day (as reported by nagios and sometimes our users). I am not aware of any configuration changes. The web stack is standard - nginx server proxying requests to php-fpm, which runs a wordpress-based app.


The nginx error log contains a lot of messages like this:

[error] 31180#31180: *451395 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: x.x.x.x, server: x.x, request: "GET /x/x/ HTTP/1.0", upstream: "fastcgi://127.0.0.1:9000", host: "x.x.x"

Most of them come from a client IP that is the IP of the server itself (not sure why, maybe some monitoring?), but there are errors from random public IPs as well.

The PHP-FPM log gives warnings like this approximately every hour:

WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 0 idle, and 71 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 0 idle, and 75 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 79 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 83 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 87 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 91 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 95 total children
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 99 total children
WARNING: [pool www] server reached pm.max_children setting (100), consider raising it

Things I have tried

Rebooting

Obvious, but did not help at all.

Increasing resources, PHP-FPM children processes

  • Increasing available RAM, CPU did not help. The disk is not full, inodes are not fully used.
  • With the increased resources, I set the pm.max_children up to 100. It was originally 40, and that was okay for years of operation. After I saw the log, I tried turning it up to 75, then 100.
  • Another website with several times more visitors has less hardware and works just fine. This website is not serving any difficult content, mostly just blogs.
  • For sake of completion, the FPM configuration looks like this:

    pm.max_children = 100
    pm.start_servers = 24
    pm.min_spare_servers = 4
    pm.max_spare_servers = 64
    pm.max_requests = 500
    
  • There are no mentions in logs about running OOM either.

Investigating opcache

  • I read that opcache running out of memory could be the culprit. Alas, it has memory to spare:

    Cache hits  89757614
    Cache misses    1174
    Used memory 58333696
    Free memory 75884032
    Wasted memory   0
    OOM restarts    0
    

Nginx timeouts

  • Nginx parameters should not be the issue, as the buffer and timeout values seem to be pretty generous (I assume the unit of 3000 is seconds):

    client_header_timeout 3000;
    client_body_timeout 3000;
    fastcgi_read_timeout 3000;
    fastcgi_buffers 16 16k;
    fastcgi_buffer_size 32k;
    

Other info

  • PHP-FPM is not crashing, nothing is in its logs beside warnings about the children
  • xdebug is disabled
  • syslog,dmesg does not contain any relevant messages
  • php7.0, nginx 1.12.2

Is there anything else I can try?


Links to stuff that did not work

Martin Melka
  • 7,177
  • 16
  • 79
  • 138
  • Were you able to carry out some PHP side investigation/ analysis on why it keeps throwing `server reached pm.max_children setting (100), consider raising it`. Maybe child process(es) taking more time, threads being blocked etc causing new children to spawn to handle new requests ? – ben5556 Oct 15 '18 at 20:56
  • I did not look into that yet. Do you have any suggestions where to start here? – Martin Melka Oct 16 '18 at 08:23

0 Answers0