Persistent DB Connection in Django/WSGI application

Question

I want to keep a persistent connection open to a third party legacy database in a django powered web application.

I want to keep the a connection between the web app and the legacy database open since creating a new connection is very slow for this special DB.

It is not like usual connection pooling since I need to store the connection per web user. User "Foo" needs its own connection between web server and legacy DB.

Up to now I use Apache and wsgi, but I could change if an other solution fits better.

Up to now I use django. Here I could change, too. But the pain would be bigger since there is already a lot of code which needs to be integrated again.

Up to now I use Python. I guess Node.js would fit better here, but the pain to change is too high.

Of course some kind of timeout would be needed. If there is not http-request from user "Foo" for N minutes, then the persistent connection would need to get shut down.

How could this be solved?

Update

I call it DB but it is not a DB which is configured via settings.DATABASES. It is a strange, legacy not wide spread DB-like system I need to integrate.

If I there are 50 people online using the web app in this moment, then I need to have 50 persistent connections. One for each user.

Code to connect to DB

I could execute this line in every request:

strangedb_connection = strangedb.connect(request.user.username)

But this operation is slow. Using the connection is fast.

Of course the strangedb_connection can't be serialized and can't be stored in a session :-)

You could develop a small python app between django and db to run 24/7, but this has got some soft spots. — Wtower, Dec 09 '15 at 07:41
then could you add the actual code? which open and close connections after each request? — Louis Barranqueiro, Dec 09 '15 at 11:59
@xyres `strangedb` holds too much data. I can't keep in in memory. — guettli, Dec 14 '15 at 14:45
I recommend using the [SQLite backend ORM code](https://github.com/django/django/blob/master/django/db/backends/sqlite3/base.py) as a pattern for writing your own for your custom db (we can't write it for you since we don't know what it is, data types, etc. - you haven't given enough information to do so). However, using this as a pattern, you can then instantiate a connection in the `__init__()` method that is shared by all methods. Then write custom methods that handle queries and writes (similar to `SQLiteCursorWrapper` but specific to your database type only sharing a single connection). — Bob Dylan, Dec 14 '15 at 22:57
@BobDylan `strangedb` is a database, but not a relational one. It needs an own connection per user. With these parameters, I guess your solution does not work. If yes, please elaborate. ... Your Woody Guthrie :-) — guettli, Dec 15 '15 at 09:05
Have you tried to store this connection in a dictionary ? Not in a local context but in the global context ? Drawbacks: you will have to check if it exists an open connection for this user, you will have manually to close the connection when not used and this connection will not be shared between Django instances (you will have further work to make in order to have it scale properly and avoid running too much connections at the same time)... — Gabriel Pichot, Dec 15 '15 at 10:44
@Gahbu I could store the connection in a dictionary. But I have N workers which handle the http-requests. The first request of user Foo might go to worker1 while the next might go to worker2. Each worker is an python process. I guess storing the connection in a dictionary does not help. Please correct me if I am wrong. — guettli, Dec 15 '15 at 11:37
@guettli, Yes, I totally agree, that's what I meant by "scaling" you can, in this case go with the solution provided by Wtower, see http://stackoverflow.com/questions/6927589/shared-object-between-requests-in-django — Gabriel Pichot, Dec 15 '15 at 15:47

dnozay · Accepted Answer · 2015-12-20T23:03:37.387

worker daemon managing the connection

Your picture currently looks like:

user  ----------->  webserver  <--------[1]-->  3rd party DB

connection [1] is expensive.

You could solve this with:

user ---->  webserver  <--->  task queue[1]  <--->  worker daemon  <--[2]-> 3rd party DB

[1] task queue can be redis, celery or rabbitmq.
[2] worker daemon keeps connection open.

A worker daemon would do the connection to the 3rd party database and keep the connection open. This would mean that each request would not have to pay the connection costs. The task queue would be the inter-process communication, dispatching work to the daemon and do the queries in the 3rd party db. The webserver should be as light as possible in terms of processing and let workers do expensive tasks.

preloading with apache + modwsgi

You can actually preload and have the expensive connection done before the first request. This is done with the WSGIImportScript configuration directive. I don't remember at the top of my head if having a pre-load + forking configuration means each request will already have the connection opened and share it; but since you have most of the code, this could be an easy experiment.

preloading with uwsgi

uwsgi supports preloading too. This is done with the import directive.

I don't know who downvoted your answer. It was not me. I would like to know the reason, too. At first sight your solution looks complicated. But I like it more than thread bases solutions. Thank you for your answer. — guettli, Dec 21 '15 at 08:48
I did, but changed to upvote. The original answer was very light on detail, since the edit is makes much more sense. — Doddie, Dec 21 '15 at 09:36

Doddie · Answer 2 · 2015-12-18T22:01:10.010

As far as I can tell, you have ruled out most (all?) of the common solutions to this type of problem:

Store connection in a dictionary ... need N workers and can't guarantee which request goes to which worker
Store data in cache ... too much data
Store connection info in cache ... connection is not serialisable

As far as I can see there is really only 1 'meta' solution to this, use @Gahbu's suggestion of a dictionary and guarantee the requests for a given user go to the same worker. I.e. figure out a way to map from the User object to a given worker the same way every time (maybe hash their name and MOD by the number of workers?).

This solution would not make the most of your N workers if the currently active Users all mapped to the same worker, but if all Users are equally likely to be active at the same time then the work should be equally spread. (If they are not all equally likely then the mapping may be able to account for that).

The two possible ways I can think of doing this would be:

1. Write a custom request allocator

I'm not really familiar with the apache/wsgi interfacing land but ... it might be possible to replace the component within your Apache server that dispatches the HTTP requests to the workers with some custom logic, such that it always dispatches to the same process.

2. Run a load-balancer/proxy in front of N single threaded workers

I'm not sure if you can use a ready to go package here or not, but the concept would be:

Run a proxy that implements this 'bind the User to an index' logic
Have the proxy then forward the requests to one of N copies of your Apache/wsgi webserver which each has a single worker.

NB: This second idea I came across here: https://github.com/benoitc/gunicorn/issues/183

Summary

For both options the implementation in your existing application is pretty simple. Your application just changes to use a dictionary for storing the persistent connection (creating one if there isn't one already). Testing a single instance is the same in dev as in production. In production, the instances themselves are none the wiser that they are always asked about the same users.

I like Option 2 here for the following reasons:

Maybe there is an existing server package that allows you to define this proxy trickiness
If not, creating a custom proxy application to sit in front of your current application might not be too hard (especially considering the restrictions you are (already) under when the request reaches the strangedb service)

Looks good. What do you mean with "bind the user to an index" in this context? — guettli, Dec 18 '15 at 21:25
In my head there is some "function" that takes a hash of the User (or their session key or something unique) and then MODs it by the number N (of single threaded workers) to produce which "index" of the worker to dispatch to. Make sense? — Doddie, Dec 18 '15 at 21:59
@Doddle thank for the clarification now I get what you mean with "index" — guettli, Dec 19 '15 at 11:54

Aya · Answer 3 · 2015-12-20T00:28:18.697

Rather than having multiple worker processes, you can use use the WSGIDaemonProcess directive to have multiple worker threads which all run in a single process. That way, all the threads can share the same DB connection mapping.

With something like this in your apache config...

# mydomain.com.conf

<VirtualHost *:80>

    ServerName mydomain.com
    ServerAdmin webmaster@mydomain.com

    <Directory />
        Require all granted
    </Directory>

    WSGIDaemonProcess myapp processes=1 threads=50 python-path=/path/to/django/root display-name=%{GROUP}
    WSGIProcessGroup myapp
    WSGIScriptAlias / /path/to/django/root/myapp/wsgi.py

</VirtualHost>

...you can then use something as simple as this in your Django app...

# views.py

import thread
from django.http import HttpResponse

# A global variable to hold the connection mappings
DB_CONNECTIONS = {}

# Fake up this "strangedb" module
class strangedb(object):

    class connection(object):
        def query(self, *args):
            return 'Query results for %r' % args

    @classmethod
    def connect(cls, *args):
        return cls.connection()


# View for homepage
def home(request, username='bob'):

    # Remember thread ID
    thread_info = 'Thread ID = %r' % thread.get_ident()

    # Connect only if we're not already connected
    if username in DB_CONNECTIONS:
        strangedb_connection = DB_CONNECTIONS[username]
        db_info = 'We reused an existing connection for %r' % username
    else:
        strangedb_connection = strangedb.connect(username)
        DB_CONNECTIONS[username] = strangedb_connection
        db_info = 'We made a connection for %r' % username

    # Fake up some query
    results = strangedb_connection.query('SELECT * FROM my_table')

    # Fake up an HTTP response
    text = '%s\n%s\n%s\n' % (thread_info, db_info, results)
    return HttpResponse(text, content_type='text/plain')

...which, on the first hit, produces...

Thread ID = 140597557241600
We made a connection for 'bob'
Query results for 'SELECT * FROM my_table'

...and, on the second...

Thread ID = 140597145999104
We reused an existing connection for 'bob'
Query results for 'SELECT * FROM my_table'

Obviously, you'll need to add something to tear down the DB connections when they're no longer required, but it's tough to know the best way to do that without more info about how your app is supposed to work.

Update #1: Regarding I/O multiplexing vs multithreading

I worked with threads twice in my live and each time it was a nightmare. A lot of time was wasted on debugging non reproducible problems. I think an event-driven and a non-blocking I/O architecture might be more solid.

A solution using I/O multiplexing might be better, but would be more complex, and would also require your "strangedb" library to support it, i.e. it would have to be able to handle EAGAIN/EWOULDBLOCK and have the capacity to retry the system call when necessary.

Multithreading in Python is far less dangerous than in most other languages, due to Python's GIL, which, in essence, makes all Python bytecode thread-safe.

In practice, threads only run concurrently when the underlying C code uses the Py_BEGIN_ALLOW_THREADS macro, which, with its counterpart, Py_END_ALLOW_THREADS, are typically wrapped around system calls, and CPU-intensive operations.

The upside of this is that it's almost impossible to have a thread collision in Python code, although the downside is that it won't always make optimal use of multiple CPU cores on a single machine.

The reason I suggest the above solution is that it's relatively simple, and would require minimal code changes, but there may be a better option if you could elaborate more on your "strangedb" library. It seems rather odd to have a DB which requires a separate network connection per concurrent user.

Update #2: Regarding multiprocessing vs multithreading

...the GIL limitations around threading seem to be a bit of an issue. Isn't this one of the reasons why the trend is to use separate processes instead?

That's quite possibly the main reason why Python's multiprocessing module exists, i.e. to provide concurrent execution of Python bytecode across multiple CPU cores, although there is an undocumented ThreadPool class in that module, which uses threads rather than processes.

The "GIL limitations" would certainly be problematic in cases where you really need to exploit every single CPU cycle on every CPU core, e.g. if you were writing a computer game which had to render 60 frames per second in high-definition.

Most web-based services, however, are likely to spend most of their time waiting for something to happen, e.g. network I/O or disk I/O, which Python threads will allow to occur concurrently.

Ultimately, it's trade-off between performance and maintainability, and given that hardware is usually much cheaper than a developer's time, favoring maintainability over performance is usually more cost-effective.

Frankly, the moment you decide to use a virtual machine language, such as Python, instead of a language which compiles into real machine code, such as C, you're already saying that you're prepared to sacrifice some performance in exchange for convenience.

See also The C10K problem for a comparison of techniques for scaling web-based services.

Assuming there are no issues switching from workers to threads (the only thing being if one thread kills the process all other threads will die too, but not really a problem given the other constraints), this sounds like the best solution! — Doddie, Dec 19 '15 at 10:18
Threads ... I fell FUD (Fear, Uncertainty and Doubt) inside me. I worked with threads twice in my live and each time it was a nightmare. A lot of time was wasted on debugging non reproducible problems. I think an event-driven and a non-blocking I/O architecture might be more solid. But this just my feeling, not a solid fact. — guettli, Dec 19 '15 at 12:18
I'm a bit new to the python game but the GIL limitations around threading seem to be a bit of an issue. Isn't this one of the reasons why the trend is to use separate processes instead? — Doddie, Dec 19 '15 at 16:40

score 0 · Answer 4 · edited May 23 '17 at 10:30

0

One simple way to do this would be have another python process manage the pool of persistant connection (one for each user and can timeout when needed). And then the other python process and django can communicate with something fast like zeromq. interprocess communication in python

edited May 23 '17 at 10:30

Community

1
1

answered Dec 19 '15 at 17:19

linusx

306
2
10

Persistent DB Connection in Django/WSGI application

4 Answers4

worker daemon managing the connection

preloading with apache + modwsgi

preloading with uwsgi