Feed Algorithm + Database: Either too many rows or too slow retrieval

Question

Say I have a general website that allows someone to download their feed in a small amount of time. A user can be subscribed to many different pages, and the user's feed must be returned to the user from the server with only N of the most recent posts between all of the pages subscribed to. Originally when a user queried the server for a feed, the algorithm was as follows:

look at all of the pages a user subscribed to
getting the N most recent posts from each page
sorting all of the posts
return the N most recent posts to the user as their feed

As it turns out, doing this EVERY TIME a user tried to refresh a feed was really slow. Thus, I changed the database to have a table of feedposts, which simply has a foreign key to a user and a foreign key to the post. Every time a page makes a new post, it creates a feed post for each of its subscribing followers. That way, when a user wants their feed, it is already created and does not have to be created upon retrieval.

The way I am doing this is creating far too many rows and simply does not seem scalable. For instance, if a single page makes 1 post & has 1,000,000 followers, we just created 1,000,000 new rows in our feedpost table.

Please help! How do companies such as facebook handle this problem? Do they generate the feed upon request? Are my database relationships terrible?

score 2 · Accepted Answer · answered Nov 25 '15 at 08:49

It's not that the original schema itself would be inherently wrong, at least not based on the high-level description you have provided. The slowness stems from the fact that you're not accessing the database in a way relational databases should be accessed.

In general, when querying a relational database, you should use JOINs and in-database ordering where possible, instead of fetching a bunch of data, and then trying to connect related objects and sort them in your code. If you let the database do all this for you, it will be much faster, because it can take advantage of indices, and only access those objects that are actually needed.

As a rule of thumb, if you need to sort the results of a QuerySet in your Python code, or loop through multiple querysets and combine them somehow, you're most likely doing something wrong and you should figure out how to let the database do it for you. Of course, it's not true every single time, but certainly often enough.

Let me try to illustrate with a simple piece of code. Assume you have the following models:

class Page(models.Model):
    name = models.CharField(max_length=47)
    followers = models.ManyToManyField('auth.User', related_name='followed_pages')

class Post(models.Model):
    title = models.CharField(max_length=147)
    page = models.ForeignKey(Page, related_name='posts')
    content = models.TextField()
    time_published = models.DateTimeField(auto_now_add=True)

You could, for example, get the list of the last 20 posts posted to pages followed by the currently logged in user with the following single line of code:

latest_posts = Post.objects.filter(page__followers=request.user).order_by('-time_published')[:20]

This runs a single SQL query against your database, which only returns the (up to) 20 results that match, and nothing else. And since you're joining on primary keys of all tables involved, it will conveniently use indices for all joins, making it really fast. In fact, this is exactly the kind of operation relational databases were designed to perform efficiently.

score 1 · Answer 2 · edited May 23 '17 at 12:17

Caching will be the solution here. You will have to reduce the database reads, which are much slower as compared to cache reads.

You can use something like Redis to cache the post. Here is an amazing answer for better understanding

Is Redis just a cache

Each page can be assigned a key, and you can pull all of the posts for that page under that key.

you need not to cache everything , just cache resent M posts, where M>>N and safe enough to reduce the database calls.Now if in case user requests for posts beyond the latesd M ones, then they can be directly fetched from the DB.

Now when you have to generate the feed you can make a DB call to get all of the subscribed pages(or you can put in the cache as well) and then just get the required number of post's from the cache.

The problem here would be keeping the cache up-to date.

For that you can use something like django-signals. Whenever a new post is added, add it to the cache as well using the signal.

So for each DB write you will have to write to cache as well. But then you will not have to read from DB and as Redis is a in memory datastore it is pretty fast as compared to standard relational databases.

Edit: These are a few more articles which can help for better understanding

Does Stack Exchange use caching and if so, how

How Twitter Uses Redis to Scale - 105TB RAM, 39MM QPS, 10,000+ Instances

Feed Algorithm + Database: Either too many rows or too slow retrieval

2 Answers2