4

Assume I have a model like this:

    class Post(models.Model):
         name = models.CharField(max_length=25, unique=True)

    class Picture(models.Model):
         post = models.ForeignKey(to=Post, ondelete=models.CASCADE)
         image = models.ImageField()

Now assume I do a query something like this:

    p = Post.objects.get(name=foo)
    images = p.picture_set.all()

Now the first query obviously searches through all the posts to get the one that has name foo.
But I would like to know about the second one. Does it search through all the Picture table in the database to find all the pictures that have post=p or is the information available when I get p in the first query?
Because if it's the former, then I am worried about scalbility issues.

fuxen_p
  • 55
  • 3

1 Answers1

3

But I would like to know about the second one. Does it search through all the Picture table in the database to find all the pictures that have post=p or is the information available when I get p in the first query?

Short answer: by default a ForeignKey adds an index, making retrievals quite fast (logarithmic in the number of values, and linear in the number of reverse records).

That depends on whether the database constructs an index on the ForeignKey. By default Django will construct an index. This means that it does not only stores the rows of the table, but also a datastructure that allows fast lookup of all the rows that have a specific value.

The implementation of the index can be database dependent. In MySQL it will by default use a BTREE, this means that for the lookup of a value, it takes approximately O(log n) to obtain the collection, and O(k) with k the number of items with that foreign key to retrieve all. But other index structures exist, like some sort of hashtable, that even allow (slightly) faster lookup, although a hashtable for example will not be that efficient to retrieve all elements with a ForeignKey less than a given number.

You can also add an index on other columns, for example:

class Post(models.Model):
    name = models.CharField(max_length=25, db_index=True, unique=True)

So now retrieving all Post objects with a name a given name, will run faster as well.

Using indices is of course not "free": it means that each time you insert or remove a record, the index need to be altered as well (typically this also requires O(log n)). If you update a record by changing the value of the foreign key, then that index needs to be altered as well. So indices provide a significant speedup, but one should aim to only put indexes on column on which one performs a lookup frequently, since otherwise the cost of "maintaining" the index can be larger than the gain of speeding up the lookup process.

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
  • Is there a faster method to accomplish this? If my app has some 100 posts and each post has some 5 images, that makes it search through 500 images to get the images relevant to the post I am querying. Even log(n) seems too much. Is there a better schema I could construct so that I get images for a post instantaneously along with post data when I query it ? – fuxen_p Oct 28 '18 at 11:50
  • @fuxen_p: `log` is quite scalable. Here it means it will approximately take 6+5 steps, versus the 100 steps it would cost otherwise. Furthermore if you have like 1M records, it would take 19+5 steps, so if your database scales 10000 times larger, it would approx. take only 2.18 times the time it takes for a db of 100 records. Especially since the focus is frequently more on the *size* of the index: if the index itself is large, then that means it would take fetches from the disk. – Willem Van Onsem Oct 28 '18 at 11:53
  • Usually a database manager aims to store (large parts) of the index in *memory* such that knowing *where* the objects are located is very fast, and one only loses significant time with looking up the records from the disk. After all, memory is way faster (for a HDD typically 200 times faster, for an SSD 3 times faster). So that means that fetching records is typically more the bottleneck than accessing the index. – Willem Van Onsem Oct 28 '18 at 11:55