34

I have some management commands that are based on gevent. Since my management command makes thousands to requests, I can turn all socket calls into non-blocking calls using Gevent. This really speeds up my application as I can make requests simultaneously.

Currently the bottleneck in my application seems to be Postgres. It seems that this is because the Psycopg library that is used for connecting to Django is written in C and does not support asynchronous connections.

I've also read that using pgBouncer can speed up Postgres by 2X. This sounds great but it would be great if someone could explain how pgBouncer works and helps?

Thanks

Mridang Agarwalla
  • 43,201
  • 71
  • 221
  • 382
  • There is also a chance that your database model does not match the queries you are firing at it. Normally, the network overhead is very small compared to the work needed to fetch blocks of data from disk, also: this does not cost performance, only latency. (except maybe for the case of very frequent connects/disconnect) – wildplasser May 02 '12 at 18:48

2 Answers2

115

Besides saving the overhead of connect & disconnect where this is otherwise done on each request, a connection pooler can funnel a large number of client connections down to a small number of actual database connections. In PostgreSQL, the optimal number of active database connections is usually somewhere around ((2 * core_count) + effective_spindle_count). Above this number, both throughput and latency get worse. NOTE: Recent versions have improved concurrency, so in 2022 I would recommend something more like ((4 * core_count) + effective_spindle_count).

Sometimes people will say "I want to support 2000 users, with fast response time." It is pretty much guaranteed that if you try to do that with 2000 actual database connections, performance will be horrible. If you have a machine with four quad-core processors and the active data set is fully cached, you will see much better performance for those 2000 users by funneling the requests through about 35 database connections.

To understand why that is true, this thought experiment should help. Consider a hypothetical database server machine with only one resource to share -- a single core. This core will time-slice equally among all concurrent requests with no overhead. Let's say 100 requests all come in at the same moment, each of which needs one second of CPU time. The core works on all of them, time-slicing among them until they all finish 100 seconds later. Now consider what happens if you put a connection pool in front which will accept 100 client connections but make only one request at a time to the database server, putting any requests which arrive while the connection is busy into a queue. Now when 100 requests arrive at the same time, one client gets a response in 1 second; another gets a response in 2 seconds, and the last client gets a response in 100 seconds. Nobody had to wait longer to get a response, throughput is the same, but the average latency is 50.5 seconds rather than 100 seconds.

A real database server has more resources which can be used in parallel, but the same principle holds, once they are saturated, you only hurt things by adding more concurrent database requests. It is actually worse than the example, because with more tasks you have more task switches, increased contention for locks and cache, L2 and L3 cache line contention, and many other issues which cut into both throughput and latency. On top of that, while a high work_mem setting can help a query in a number of ways, that setting is the limit per plan node for each connection, so with a large number of connections you need to leave this very small to avoid flushing cache or even leading to swapping, which leads to slower plans or such things as hash tables spilling to disk.

Some database products effectively build a connection pool into the server, but the PostgreSQL community has taken the position that since the best connection pooling is done closer to the client software, they will leave it to the users to manage this. Most poolers will have some way to limit the database connections to a hard number, while allowing more concurrent client requests than that, queuing them as necessary. This is what you want, and it should be done on a transactional basis, not per statement or connection.

kgrittn
  • 18,113
  • 3
  • 39
  • 47
  • 2
    Excellent answer. I could not agree more. – wildplasser May 02 '12 at 19:41
  • These front-end hippies all want to make and break connections as fast as possible, and put connection poolers in front if they cannot reach their natural-high state. I like the 2*ncore + nspindle formula. Every process is considered to be blocked in a disk read. – wildplasser May 02 '12 at 22:27
  • @kgrittn I assume in your thought experiment above, each query takes one second to run in the absence of other requests? – Michael Mior May 02 '12 at 22:36
  • @MichaelMior: Yeah, I definitely meant to put that in there, but missed it. Thanks. Edited to include that assumption. – kgrittn May 02 '12 at 23:01
  • @kgrittn I don't know anything about Postgres. I guess this means there's one OS thread per client connection and it relies on the OS for thread scheduling? – Michael Mior May 03 '12 at 13:30
  • 2
    @MichaelMior: In PostgreSQL there is one OS *process* per client connection, and it relies on the OS for scheduling. The various processes communicate through a shared memory segment, OS signals (where available), and a self-referencing UDP socket. – kgrittn May 03 '12 at 14:20
  • Cool. Thanks for the education :) I've really only ever worked with MySQL. – Michael Mior May 03 '12 at 14:37
12

PgBouncer reduces the latency in establishing connections by serving as a proxy which maintains a connection pool. This may help speed up your application if you're opening many short-lived connections to Postgres. If you only have a small number of connections, you won't see much of a win.

Michael Mior
  • 28,107
  • 9
  • 89
  • 113
  • If Iäve understood this correcty - Django still does create connections again and again but pgBouncer reduces the time needed to create this connection. I've heard that Django creates a new connection for every request. By request do people mean a web-request to fetch a page (which means that every single command executed in the cycle of a view goes through one single database connection) or does request mean every individual database hit (SELECT, INSERT, UPDATE and DELETE) in which case every single command would be executed within a new connection even though they would be in same view cycle – Mridang Agarwalla May 02 '12 at 18:46
  • 2
    Yes, Django will create a new connection, but the connection will be established faster since it will be to a local PgBouncer instance. Django will use a new connection for every web request, not database query. – Michael Mior May 02 '12 at 18:58
  • 1
    You might find [this question](http://stackoverflow.com/questions/1125504/django-persistent-database-connection) has some more interesting info. But note that there is a reason new connections are opened on every request. If a request encounters an error, it's possible a transaction might not be properly closed (among other things) leading to unexpected results. – Michael Mior May 02 '12 at 18:59
  • Is there a way I could check how many times does Django create and destroys a connection during the time my management command runs. I could check this and if DJango is making a lot of fresh connections, it would be good to use pgBouncer otherwise I'll look into another scenario. Its a management command a and not a web page request so I'm wondering if the database connection is really created only once. I have thousands od db hits from my management command. Thanks. – Mridang Agarwalla May 02 '12 at 19:06
  • 2
    You can connect to the signal `django.db.backends.signals.connection_created` and then do some logging. (Note that you wouldn't want to do this in production since it will add unnecessary overhead.) – Michael Mior May 02 '12 at 22:29