30

I have a Django project based on multiple PostgreSQL servers.

I want users to be sharded across those database servers using the same sharding logic used by Instagram:

User ID => logical shard ID => physical shard ID => database server => schema => user table

  • The logical shard ID is directly calculated from the user ID (13 bits embedded in the user id).
  • The mapping from logical to physical shard ID is hard coded (in some configuration file or static table).
  • The mapping from physical shard ID to database server is also hard coded. Instagram uses Pgbouncer at this point to retrieve a pooled database connection to the appropriate database server.
  • Each logical shard lives in its own PostgreSQL schema (for those not familiar with PostgreSQL, this is not a table schema, it's rather like a namespace, similar to MySQL 'databases'). The schema is simply named something like "shardNNNN", where NNNN is the logical shard ID.
  • Finally, the user table in the appropriate schema is queried.

How can this be achieved as simply as possible in Django ?

Ideally, I would love to be able to write Django code such as:

Fetching an instance

# this gets the user object on the appropriate server, in the appropriate schema:
user = User.objects.get(pk = user_id)

Fetching related objects

# this gets the user's posted articles, located in the same logical shard:
articles = user.articles

Creating an instance

# this selects a random logical shard and creates the user there:
user = User.create(name = "Arthur", title = "King")
# or:
user = User(name = "Arthur", title = "King")
user.save()

Searching users by name

# fetches all relevant users (kings) from all relevant logical shards
# - either by querying *all* database servers (not good)
# - or by querying a "name_to_user" table then querying just the
#   relevant database servers.
users = User.objects.filter(title = "King")

To make things even more complex, I use Streaming Replication to replicate every database server's data to multiple slave servers. The masters should be used for writes, and the slaves should be used for reads.

Django provides support for automatic database routing which is probably sufficient for most of the above, but I'm stuck with User.objects.get(pk = user_id) because the router does not have access to the query parameters, so it does not know what the user ID is, it just knows that the code is trying to read the User model.

I am well aware that sharding should probably be used only as a last resort optimization since it has limitations and really makes things quite complex. Most people don't need sharding: an optimized master/slave architecture can go a very long way. But let's assume I do need sharding.

In short: how can I shard data in Django, as simply as possible?

Thanks a lot for your kind help.

Note

There is an existing question which is quite similar, but IMHO it's too general and lacks precise examples. I wanted to narrow things down to a particular sharding technique I'm interested in (the Instagram way).

Community
  • 1
  • 1
MiniQuark
  • 46,633
  • 36
  • 147
  • 183
  • 2
    This is going to require heavy modifications to the ORM. – Ignacio Vazquez-Abrams Aug 02 '12 at 17:09
  • I am not familiar with Django's database routing functionality, but have you seen the `hints` parameter in `db_for_read(model, **hints)` ([Link](https://docs.djangoproject.com/en/dev/topics/db/multi-db/#db_for_read))? See also the [description for this](https://docs.djangoproject.com/en/dev/topics/db/multi-db/#hints) – cyroxx Aug 02 '12 at 17:12
  • Yes, I checked it out, but it only contains an "instance" parameter when an instance already exists in memory, so this rules out things like User.objects.get(pk = user_id): the router does not have access to the pk. – MiniQuark Aug 02 '12 at 20:19
  • FYI, mysql databases are also called mysql schemas. No terminology confusion here. MySQL even supports the CREATE SCHEMA command. – fabspro May 29 '13 at 05:11

3 Answers3

9

Mike Clarke recently gave a talk at PyPgDay on how Disqus shards their users with Django and PostgreSQL. He wrote up a blog post on how they do it.

Several strategies can be employed when sharding Postgres databases. At Disqus, we chose to shard based on table name. Where as the original table name as generated by Django might be comments_post, our sharding tools will rewrite the SQL to query a table comments_post_X, where X is the shard ID calculated based on a consistent hashing scheme. All these tables live in a single schema, on a single database instance.

In addition, they released some code as part of a sample application demonstrating how they shard.

Streeter
  • 556
  • 4
  • 22
2

You really don't want to be in the position of asking this question. If you are sharding by user id then you probably don't want to search by name.

If you are sharding your database then it's not going to be invisible to your application and will probably end up requiring schema alterations.

You might find SkyTools useful - read up on PL/Proxy. It's how Skype shard their databases.

Yushin Washio
  • 675
  • 8
  • 12
Richard Huxton
  • 21,516
  • 3
  • 39
  • 51
  • Thanks for the link, Richard. I agree that this is a complex problem, but lots of companies have had to solve it (Facebook, Google, Skype, Twitter, etc.): you can search users by name, or parts of a name, so it's complex even if users are sharded by name. Is it some kind of Map/Reduce algorithm? But how can that scale since every search would query all databases? I would love to know how they do it! But solving the other issues would be great. – MiniQuark Aug 02 '12 at 21:23
  • 1
    You don't solve the "query all databases" problem. You can't. You are also adding a host of functional and performance dependencies on your network infrastructure. I've not even started on issues with consistency since you seem to want to query backup nodes too. If you can solve this problem just by spending 10,000 on more RAM and disks I'd do that - it'll be cheaper than your time. – Richard Huxton Aug 03 '12 at 05:33
  • I agree, as I mentioned in my question: "I am well aware that sharding should probably be used only as a last resort optimization". So more RAM, more disks, yes. But let's forget about the "Searching users by name" part for now. Do you have answers or clues for the other 3 examples? – MiniQuark Aug 03 '12 at 08:15
  • Start by reading up about pl/proxy and skytools. That's about the simplest setup I've heard of, but it involves routing all the queries through functions. The idea that you'll be able to do this without changing almost all of your queries is unrealistic I'm afraid. I'm not aware of any out-of-the-box ORM that can handle this sort of thing either. Unless this is just a learning exercise, I really don't think it's a good idea from where you are starting. Skype, Twitter et al don't do this because it's cool, they do it because they can't buy a big enough box no matter how much money they have. – Richard Huxton Aug 03 '12 at 09:34
  • Ok, thanks for your advice. This question is for a real project that is only just starting. Version 1.0 will be implemented without any sharding whatsoever, and it's likely that sharding will never be needed. But investors want to make sure that every option has been thoroughly considered, even the possibility of a great success, where sharding would be required. So I'm trying to find answers, perhaps bring up a little demo as well, but that's it for now. Plus, I admit that I find this subject particularly interesting. :-) – MiniQuark Aug 03 '12 at 10:26
  • Either (1) build your whole system around it from the start, or (2) when you need to purchase a really big server start replanning your setup at the same time. The more relationships between different tables the harder it will be. – Richard Huxton Aug 03 '12 at 10:47
  • The application will allow users to send messages to one-another, and to search full-text in sent messages as well as in received messages (much like in Gmail). If I ever need to shard users, where would I store the messages? Should they be attached only to the sender, or should they also be attached to the recipients? If a message is sent to 9 recipients, there would be 10 copies stored. Even though its a waste of disk space, I think it's probably best to duplicate data (disk space is cheap compared to solving the full-text search issues). – MiniQuark Aug 03 '12 at 11:08
  • Well, that's the problem with sharding, isn't it? The most efficient solution will depend on the ratio of users to messages and exactly how searches operate. Oh, and whether you have any other options than send/receive/search (browse, sort, etc). Without a specific usage pattern in mind you'll not be able to tell. – Richard Huxton Aug 03 '12 at 13:03
0

it is better to use professional sharding middleware, for example: Apache ShardingSphere.

The project contains 2 productions, ShardingSphere-JDBC for java driver, and ShardingSphere-Proxy for all programing languages. It can support python and Django as well.