uWSGI python highload configuration

Question

We have a big EC2 instance with 32 cores, currently running Nginx, Tornado and Redis, serving on average 5K requests per second. Everything seems to work fine, but the CPU load already reaching 70% and we have to support even more requests. One of the thoughts was to replace Tornado with uWSGI because we don't really use async features of Tornado.

Our application consist from one function, it receives a JSON (~=4KB), doing some blocking but very fast stuff (Redis) and return JSON.

Proxy HTTP request to one of the Tornado instances (Nginx)
Parse HTTP request (Tornado)
Read POST body string (stringified JSON) and convert it to python dictionary (Tornado)
Take data out of Redis (blocking) located on same machine (py-redis with hiredis)
Process the data (python3.4)
Update Redis on same machine (py-redis with hiredis)
Prepare stringified JSON for response (python3.4)
Send response to proxy (Tornado)
Send response to client (Nginx)

We thought the speed improvement will come from uwsgi protocol, we can install Nginx on separate server and proxy all requests to uWSGI with uwsgi protocol. But after trying all possible configurations and changing OS parameters we still can't get it working even on current load. Most of the time nginx log contains 499 and 502 errors. In some configurations it just stopped receiving new requests like it hit some OS limit.

So as I said, we have 32 cores, 60GB free memory and very fast network. We don't do heavy stuff, only very fast blocking operations. What is the best strategy in this case? Processes, Threads, Async? What OS parameters should be set?

Current configuration is:

[uwsgi]
master = 2
processes = 100
socket = /tmp/uwsgi.sock
wsgi-file = app.py
daemonize = /dev/null
pidfile = /tmp/uwsgi.pid
listen = 64000
stats = /tmp/stats.socket
cpu-affinity = 1
max-fd = 20000
memory-report = 1
gevent = 1000
thunder-lock = 1
threads = 100
post-buffering = 1

Nginx config:

user www-data;
worker_processes 10;
pid /run/nginx.pid;

events {
    worker_connections 1024;
    multi_accept on;
    use epoll;
}

OS config:

sysctl net.core.somaxconn
net.core.somaxconn = 64000

I know the limits are too high, started to try every value possible.

UPDATE:

I ended up with the following configuration:

[uwsgi]
chdir = %d
master = 1
processes = %k
socket = /tmp/%c.sock
wsgi-file = app.py
lazy-apps = 1
touch-chain-reload = %dreload
virtualenv = %d.env
daemonize = /dev/null
pidfile = /tmp/%c.pid
listen = 40000
stats = /tmp/stats-%c.socket
cpu-affinity = 1
max-fd = 200000
memory-report = 1
post-buffering = 1
threads = 2

score 12 · Accepted Answer · answered Apr 07 '15 at 14:17

I think your request handling roughly breaks down as follows:

HTTP parsing, request routing, JSON parsing
execute some python code which yields a redis request
(blocking) redis request
execute some python code which processes the redis response
JSON serialization, HTTP response serialization

You could benchmark the handling time on a near-idle system. My hunch is that the round trip would boil down to 2 or 3 milliseconds. At 70% CPU load this would go up to about 4 or 5 ms (not counting time spent in nginx request queue, just the handling in uWSGI worker).

At 5k req/s your average in-process request could would be in the 20 ... 25 range. A decent match to your VM.

Next step is to balance the CPU cores. If you have 32 cores, it does not make sense to allocate 1000 worker processes. You might end up chocking the system on context switching overhead. A good balancing will have the total amount of workers (nginx+uWSGI+redis) in the order of magnitude as the available CPU cores, maybe with a little extra to cover for blocking I/O (i.e. filesystem, but mainly networked requests being done to other hosts like a DBMS). If blocking I/O becomes a big part of the equation, consider rewriting into asynchronous code and integrating an async stack.

First observation: you're allocating 10 workers to nginx. However the CPU time nginx spends on a request is MUCH lower than the time uWSGI spends on it. I would start by dedicating about 10% of the system to nginx (3 or 4 worker processes).

The remainder would have to be split between uWSGI and redis. I don't know about the size of your indices in redis, or about the complexity of your python code, but my first attempt would be a 75%/25% split between uWSGI and redis. That would put redis on about 6 workers and uWSGI on about 20 workers + a master.

As for the threads option in uwsgi configuration: thread switching is lighter than process switching, but if a significant part of your python code is CPU-bound it won't fly because of GIL. Threads option is mainly interesting if a significant part of your handling time is I/O blocked. You could disable threads, or try with workers=10, threads=2 as an initial attempt.

Thank you very much, I was able to get it running up to 9K requests per second with 16 processes and 20 threads. The CPU is around 45% now. I can try and reduce number of threads, though this server is in production and I can't play with it too much. Also I've adjusted OS parameters according to this [article](http://www.nateware.com/linux-network-tuning-for-2013.html#.VSPuC9-jmkA). The problem with redis is it works only with one core, but I can reduce workers on nginx. If I understand right, the correct number of workers is when every one of them using close to 100% of core. — offline15, Apr 07 '15 at 14:52
Glad to hear you made nice improvements (5000 req/s @ 70% CPU --> 9000 req/s % 45%)! I think the uWSGI _threads_ setting is still way too high. Now you have 16 workers * 20 threads each. 20 simultaneous threads per worker only maxes out the CPU when the python code is 95% I/O bound. I'd enable a little bit more workers (like 24) and reduce the _threads_ setting to 2 or maybe 4. Run htop to see how the load is balanced over the CPUs. — Freek Wiekmeijer, Apr 08 '15 at 06:48
Is there any way to know how much time process is I/O bound and how much time CPU working on calculations on a live running process? — offline15, Apr 08 '15 at 10:28
The _top_ command gives an indication: `%Cpu(s): 7,8 us, 3,3 sy, 0,0 ni, 89,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st` (user/system/idle/wait/hardware interupt/software interupt/stolen). See http://linuxaria.com/howto/understanding-the-top-command-on-linux for explanation. press `1` in top to see stats for each individual CPU core. — Freek Wiekmeijer, Apr 08 '15 at 11:40
@offline15 can you please update a thread with your success config? I need it too :) — woozly, Dec 17 '15 at 13:14
Updated to the latest configuration I have on my server. Together with code changes, I was able to run 30k/s — offline15, Dec 19 '15 at 06:40

uWSGI python highload configuration

1 Answers1