46

I work for a rather busy internet site that is often gets very large spikes of traffic. During these spikes hundreds of pages per second are requested and this produces random 502 gateway errors.

Now we run Nginx (1.0.10) and PHP-FPM on a machine with 4x SAS 15k drives (raid10) with a 16 core CPU and 24GB of DDR3 ram. Also we make use of the latest Xcache version. The DB is located on another machine, but this machine's load is very low, and has no issues.

Under normal load everything runs perfect, system load is below 1, and PHP-FPM status report never really shows more than 10 active processes at one time. There is always about 10GB of ram still available. Under normal load the machine handles about 100 pageviews per second.

The problem arises when huge spikes of traffic arrive, and hundreds of page-views per second are requested from the machine. I notice that FPM's status report then shows up to 50 active processes, but that is still way below the 300 max connections that we have configured. During these spikes Nginx status reports up to 5000 active connections instead of the normal average of 1000.

OS Info: CentOS release 5.7 (Final)

CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GH (16 cores)

php-fpm.conf

daemonize = yes
listen = /tmp/fpm.sock
pm = static
pm.max_children = 300
pm.max_requests = 1000

I have not setup rlimit_files, because as far as I know it should use the system default if you don't.

fastcgi_params (only added values to standard file)

fastcgi_connect_timeout 60;
fastcgi_send_timeout 180;
fastcgi_read_timeout 180;
fastcgi_buffer_size 128k;
fastcgi_buffers 4 256k;
fastcgi_busy_buffers_size 256k;
fastcgi_temp_file_write_size 256k;
fastcgi_intercept_errors on;

fastcgi_pass            unix:/tmp/fpm.sock;

nginx.conf

worker_processes        8;
worker_connections      16384;
sendfile                on;
tcp_nopush              on;
keepalive_timeout       4;

Nginx connects to FPM via Unix Socket.

sysctl.conf

net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 1
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
net.ipv4.tcp_max_syn_backlog = 2048
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.all.log_martians = 1
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.default.secure_redirects = 0
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.icmp_ignore_bogus_error_responses = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.tcp_timestamps = 0
net.ipv4.conf.all.rp_filter=1
net.ipv4.conf.default.rp_filter=1
net.ipv4.conf.eth0.rp_filter=1
net.ipv4.conf.lo.rp_filter=1
net.ipv4.ip_conntrack_max = 100000

limits.conf

* soft nofile 65536
* hard nofile 65536

These are the results for the following commands:

ulimit -n
65536

ulimit -Sn
65536

ulimit -Hn
65536

cat /proc/sys/fs/file-max
2390143

Question: If PHP-FPM is not running out of connections, the load is still low, and there is plenty of RAM available, what bottleneck could be causing these random 502 gateway errors during high traffic?

Note: by default this machine's ulimit's were 1024, since I changed it to 65536 I have not fully rebooted the machine, as it's a production machine and it would mean too much downtime.

Mr.Boon
  • 2,024
  • 7
  • 35
  • 48
  • maybe something else is running alongside ? http://www.osbmedia.com/blog/view/php-fpm-nginx-502-gateway – c69 Jan 07 '12 at 19:23
  • 1
    nginx itself can be the bottleneck. track IO as well, not only load. And I think that's more something for serverfault, I'm a coder not a sysadmin. – hakre Jan 07 '12 at 19:40
  • I have xcache running also. 256MB dedicated to Xcache, but that's never all used. – Mr.Boon Jan 07 '12 at 20:44
  • Probably you serving dynamic content with nginx, it is known issue. nginx handling static content very fine, but for dynamic it is better to use another server and nginx as a transparent proxy server. ex: apache(dynamic content)+nginx(as trransparent proxy(static content)) in this case your server should be able to handle unbelievable loads. I know that it is a massive change for such scale systems, but you can try such config on another server and then compare ab tests to get the difference. – Valentin Rusk Feb 21 '12 at 11:20
  • Would be interesting to get samples of the nginx access.log and php-fpm's slow log. Your server looks very capable, but that's an assumption too since I don't know what kind of application your server. And just by framework of choice you can cripple performance. – Till Feb 26 '12 at 02:09
  • A 502 error means there's something going on between nginx and php5-fpm. Something is causing the backend to refuse the connection. That usually happens because something is pegged. That's a fairly large number of connections, but not extreme by any means. Could you check to see if your pipe is saturated? Could you add some evidence that the issue is not a saturated pipe as well as some resource usage during your peak times? Also, check your nginx error logs and php logs. If the requests at least make it to PHP, you should see a log of this. - Add info and I'll see if I can help any further. – MTeck Mar 07 '12 at 21:23
  • Check the error file to see if there's any error there... Also, why don't you use lighttpd + php-fastcgi in port 8080 and send the traffic there using nginx? – StiveKnx Mar 08 '12 at 19:20

4 Answers4

28

This should fix it...

You have: fastcgi_buffers 4 256k;

Change it to: fastcgi_buffers 256 16k; // 4096k total

Also set fastcgi_max_temp_file_size 0, that will disable buffering to disk if replies start to exceeed your fastcgi buffers.

Timothy Perez
  • 20,154
  • 8
  • 51
  • 39
24

Unix socket accept 128 connections by default. It is good to put this line into /etc/sysctl.conf

net.core.somaxconn = 4096
Fluffeh
  • 33,228
  • 16
  • 67
  • 80
kait
  • 241
  • 2
  • 2
1

If it's not helping in some cases - use normal port bind instead of socket, because socket on 300+ can block new requests forcing nginx to show 502.

Misiek
  • 103
  • 8
0

@Mr. Boon

I have 8 core 14 GB ram. But the system gives Gateway time-out very often.
Implementing below fix also didn't solved the issue. Still searching for better fixes.

You have: fastcgi_buffers 4 256k;

Change it to:

fastcgi_buffers 256 16k; // 4096k total

Also set fastcgi_max_temp_file_size 0, that will disable buffering to disk if replies start to exceed your fastcgi buffers.

Thanks.

Priya
  • 1,359
  • 6
  • 21
  • 41