Multi-tenant Django applications: altering database connection per request?

Question

I'm looking for working code and ideas from others who have tried to build a multi-tenant Django application using database-level isolation.

Update/Solution: I ended solving this in a new opensource project: see django-db-multitenant

Goal

My goal is to multiplex requests as they come in to a single app server (WSGI frontend like gunicorn), based on the request hostname or request path (for instance, foo.example.com/ sets the Django connection to use database foo, and bar.example.com/ uses database bar).

Precedent

I'm aware of a few existing solutions for multi tenancy in Django:

django-tenant-schemas: This is very close to what I want: you install its middleware at highest precedence, and it sends a SET search_path command to the db. Unfortunately, it is Postgres specific and I am stuck with MySQL.
django-simple-multitenant: The strategy here is to add a "tenant" foreign key to all models, and adjust all application business logic to key off of that. Basically each row is becomes indexed by (id, tenant_id) rather than (id). I've tried, and don't like, this approach for a number of reasons: it makes the application more complex, it can lead to hard-to-find bugs, and it provides no database-level isolation.
One {app server, django settings file with appropriate db} per tenant. Aka poor man's multi tenancy (actually rich man's, given the resources it involves). I do not want to spin up a new app server per tenant, and for scalability I want any app server to be able to dispatch requests for any client.

Ideas

My best idea so far is to do something like django-tenant-schemas: in the first middleware, grab django.db.connection and fiddle with the database selection rather than the schema. I haven't quite thought through what this means in terms of pooled/persistent connections

Another dead end I pursued was tenant-specific table prefixes: Setting aside that I'd need them to be dynamic, even a global table prefix is not easily achieved in Django (see rejected ticket 5000, among others).

Finally, Django multiple database support lets you define multiple named databases, and mux among them based on the instance type and read/write mode. Not helpful since there is no facility to select the db on a per-request basis.

Question

Has anyone managed something similar? If so, how did you implement it?

As originally written, it was not a great fit for Stack Overflow. I've edited out the egregious parts. — George Stocker, May 31 '13 at 01:08
With respect to the "egregious" parts: No prob, tho I am still interested in the "discouraging" advice I asked for, even if anecdotal in nature, since IMO this is a design/architecture question: There are multiple fundamentally different ways to approach multi-tenant application design, so firsthand experience is valuable in gauging design tradeoffs that might not be immediately obvious. Here's a discussion on HN that has helped a bit: https://news.ycombinator.com/item?id=4270003 — mik3y, Jun 04 '13 at 03:53
Such discussions aren't a good fit for Stack Overflow. I edited them out because of that. — George Stocker, Jun 04 '13 at 11:51
It probably isn't specific to your situation, but you really should consider that last option again. I work at a large financial institution, and the shared memory space for an application tier is a huge nogo for us when we evaluate vendors. I understand your concerns for scalability, but if you used something like Puppet or Chef, you could automate these deployments and simply add an entry to your first tier web server. With memory and compute as cheap as they are now, the small amount of extra resources for the extra Django instances would have minimal cost impact. — Titus P, Jun 05 '13 at 16:13
@Threaten: thanks for the comments; it's useful to hear another perspective, I don't think there's a single universally-correct design. I am leaning towards that "option 3" approach for initial deployment, since in addition to the superior isolation you mention, it's the least amount of change compared to a "stock" django app. (In the HN thread I linked above, someone also pointed out that it's also very easy for developers to reason about a customer's live system and request flows when done this way.) — mik3y, Jun 05 '13 at 17:00

Austin Phillips · Answer 1 · 2013-06-06T13:01:20.563

I've done something similar that is closest to point 1, but instead of using middleware to set a default connection Django database routers are used. This allow application logic to use a number of databases if required for each request. It's up to the application logic to choose a suitable database for every query, and this is the big downside of this approach.

With this setup, all databases are listed in settings.DATABASES, including databases which may be shared among customers. Each model that is customer specific is placed in a Django app that has a specific app label.

eg. The following class defines a model which exists in all customer databases.

class MyModel(Model):
    ....
    class Meta:
        app_label = 'customer_records'
        managed = False

A database router is placed in the settings.DATABASE_ROUTERS chain to route database request by app_label, something like this (not a full example):

class AppLabelRouter(object):
    def get_customer_db(self, model):
        # Route models belonging to 'myapp' to the 'shared_db' database, irrespective
        # of customer.
        if model._meta.app_label == 'myapp':
            return 'shared_db'
        if model._meta.app_label == 'customer_records':
            customer_db = thread_local_data.current_customer_db()
            if customer_db is not None:
                return customer_db

            raise Exception("No customer database selected")
        return None

    def db_for_read(self, model, **hints):
        return self.get_customer_db(model, **hints)

    def db_for_write(self, model, **hints):
        return self.get_customer_db(model, **hints)

The special part about this router is the thread_local_data.current_customer_db() call. Before the router is exercised, the caller/application must have set up the current customer db in thread_local_data. A Python context manager can be used for this purpose to push/pop a current customer database.

With all of this configured, the application code then looks something like this, where UseCustomerDatabase is a context manager to push/pop a current customer database name into thread_local_data so that thread_local_data.current_customer_db() will return the correct database name when the router is eventually hit:

class MyView(DetailView):
    def get_object(self):
        db_name = determine_customer_db_to_use(self.request) 
        with UseCustomerDatabase(db_name):
            return MyModel.object.get(pk=1)

This is quite a complex setup already. It works, but I'll try to summarize what I see see as advantages and disadvantages:

Advantages

Database selection is flexible. It allows multiple database to be used in a single query, both customer specific and shared databases can be used in a request.
Database selection is explicit (not sure if this is an advantage or disadvantage). If you try to run a query that hits a customer database but the application hasn't selected one, an exception will occur indicating a programming error.
Using a database router allows different databases to exist on different hosts, rather than relying on a USE db; statement that guesses that all databases are accessible through a single connection.

Disadvantages

It's complex to setup, and there are quite a few layers involved to get it functioning.
The need and use of thread local data is obscure.
Views are littered with database selection code. This could be abstracted using class based views to automatically choose a database based on request parameters in the same manner as middleware would choose a default database.
The context manager to choose a database must be wrapped around a queryset in such a manner that the context manager is still active when the query is evaluated.

Suggestions

If you want flexible database access, I'd suggest to use Django's database routers. Use Middleware or a view Mixin which automatically sets up a default database to use for the connection based on request parameters. You might have to resort to thread local data to store the default database to use so that when the router is hit, it knows which database to route to. This allows Django to use its existing persistent connections to a database (which may reside on different hosts if wanted), and chooses the database to use based on routing set up in the request.

This approach also has the advantage that the database for a query can be overridden if needed by using the QuerySet using() function to select a database other than the default.

Thanks for the insightful answer! I'm marking this the answer now, realizing that there's no single "correct" architecture; you give a good overview of this approach. — mik3y, Jun 06 '13 at 21:18
Here's what I ended up implementing: https://github.com/mik3y/django-db-multitenant — mik3y, Jul 11 '13 at 17:57

score 5 · Accepted Answer · answered Jul 11 '13 at 17:58

For the record, I chose to implement a variation of my first idea: issue a USE <dbname> in an early request middleware. I also set the CACHE prefix the same way.

I'm using it on a small production site, looking up the tenant name from a Redis database based on the request host. So far, I'm quite happy with the results.

I've turned it into a (hopefully resuable) github project here: https://github.com/mik3y/django-db-multitenant

score 2 · Answer 3 · answered May 31 '13 at 01:02

2

You could create a simple middleware of your own that determined the database name from your sub-domain or whatever and then executed a USE statement on the database cursor for each request. Looking at the django-tenants-schema code, that is essentially what it is doing. It is sub-classing psycopg2 and issuing the postgres equivalent to USE, "set search_path XXX". You could create a model to manage and create your tenants too, but then you would be re-writing much of django-tenants-schema.

There should be no performance or resource penalty in MySQL to switching the schema (db name). It is just setting a session parameter for the connection.

answered May 31 '13 at 01:02

Victor Bruno

1,033
7
12

Agreed, although that appears to be what the OP has already considered as his first idea. He noted "*I haven't quite thought through what this means in terms of pooled/persistent connections*" - are you able to illuminate? – eggyal May 31 '13 at 13:46
Yes, that's basically what I'm describing in the first paragraph under "Ideas". It's probably the route I'll go, at least as a first experiment. I'd love to know if anyone has done it in practice; it seems we both agree it's not vastly different from the postgres schemas approach. – mik3y Jun 04 '13 at 03:56
Also: "There should be no performance or resource penalty" -- well, it's not free, but maybe it's the least expensive choice; we have to execute that `USE` after all. When combined with persistent db connections, I'll need some sort of LRU cache of connections, each bound to a specific tenant on setup. This is the part I hand waved through, and is an area in which I'm curious if there's precedent. – mik3y Jun 04 '13 at 04:06

Multi-tenant Django applications: altering database connection per request?

Goal

Precedent

Ideas

Question

3 Answers3

Linked