4

In my app, I can describe an Entity using different Protocols, with each Protocol being a collection of various Traits, and each Trait allows two or more Classes. So, a Description is a collection of Expressions. E.g., I want to describe an entity "John" with the Protocol "X" that comprises the following two Traits and Classes:

Protocol ABC

Trait 1: Height

Available Classes: a. Short b. Medium c. Tall

Trait 2: Weight

Available Classes: a. Light b. Medium c. Heavy

John's Description: Expression 1: c. Tall, Expression 2: b. Medium

My model specification (barebone essentials for simplicity):

class Protocol(models.Model):
    """
    A Protocol is a collection of Traits
    """
    name = models.CharField()

class Trait(models.Model):
    """
    Stores the Traits. Each Trait can have multiple Classes
    """

    name = models.CharField()
    protocol = models.ForeignKey(
        Protocol,
        help_text="The reference protocol of the trait",
    )

class Class(models.Model):
    """
    Stores the different Classes related to a Trait.
    """

    name = models.CharField()
    trait = models.ForeignKey(Trait)

class Description(models.Model):
    """
    Stores the Descriptions. A description is a collection of Expressions.
    """

    name = models.CharField()
    protocol = models.ForeignKey(
        Protocol,
        help_text="reference to the protocol used to make the description;\
            this will define which Traits will be available",
    )
    entity = models.ForeignKey(
        Entity,
        help_text="the Entity to which the description refers to",
    )

class Expression(models.Model):
    """
    Stores the expressions of entities related to a specific
    Description. It refers to one particular Class (which is
    then associated with a specific Trait)
    """

    class = models.ForeignKey(Class)
    description = models.ForeignKey(Description)

Following the previous example, let's say I want to find all the Entities that are medium or tall (Trait 1) and heavy (Trait 2). The query I'm now using is the following:

# This is the filter returned by the HTML form, which list
# all the available Classes for each Trait of the selected Protocol
filters = [
  {'trait': 1, 'class': [2, 3]},
  {'trait': 2, 'class': [6,]},
]

queryset = Description.objects.all()

for filter in filters:
  queryset = queryset.filter(expression_set__class__in=filter["class"])

The problem is that the query is slow (I have ATM ~1000 Descriptions, described with a Protocol of 40 Traits, each Trait having 2 to 5 Classes). It takes about two seconds to return the results even when filtering by only 5-6 Expressions. I tried using prefetch_related("expression_set") or prefetch_related("expression_set__class") but with no significant improvement.

The question is: can you suggest a way to improve the performance, or this is simply the reality of searching through so many tables?

Thank you very much for your time.


EDIT: The following is the query generated by the Manager when, e.g., eight filters (see previous code snippet) are applied.

SELECT "describe_description"."id",
       "describe_description"."name",
       "describe_description"."protocol_id",
  FROM "describe_description"
 INNER JOIN "describe_expression"
    ON ("describe_description"."id" = "describe_expression"."description_id")
 INNER JOIN "describe_expression" T4
    ON ("describe_description"."id" = T4."description_id")
 INNER JOIN "describe_expression" T6
    ON ("describe_description"."id" = T6."description_id")
 INNER JOIN "describe_expression" T8
    ON ("describe_description"."id" = T8."description_id")
 INNER JOIN "describe_expression" T10
    ON ("describe_description"."id" = T10."description_id")
 INNER JOIN "describe_expression" T12
    ON ("describe_description"."id" = T12."description_id")
 INNER JOIN "describe_expression" T14
    ON ("describe_description"."id" = T14."description_id")
 INNER JOIN "describe_expression" T16
    ON ("describe_description"."id" = T16."description_id")
 INNER JOIN "describe_expression" T18
    ON ("describe_description"."id" = T18."description_id")
 WHERE ("describe_expression"."class_id" IN (732) AND T4."class_id" IN (740) AND T6."class_id" IN (760) AND T8."class_id" IN (783) AND T10."class_id" IN (794) AND T12."class_id" IN (851) AND T14."class_id" IN (857) AND T16."class_id" IN (860) AND T18."class_id" IN (874))
kofm
  • 125
  • 11
  • 2
    Can you also share how you display the results? – Brian Destura Aug 14 '22 at 10:57
  • It's better to use multiple functions (`def func():`) instead of multiple classes, Especially if you are calling these multiple times with a large dataset. Check [this question](https://stackoverflow.com/questions/8492624/how-much-slower-python-classes-are-compared-to-their-equivalent-functions) out – DialFrost Aug 14 '22 at 11:27
  • @BrianDestura I simply return a list of the matching Description, only the "name" and "protocol" property – kofm Aug 14 '22 at 11:55
  • 1
    Try to use `.select_related("protocol")` in your description queryset. Otherwise it might be worth using something like django debug toolbar to check what queries are being run, and which one takes the most time – Brian Destura Aug 15 '22 at 04:40
  • Just for info from which class you are filtering the data ? – lord stock Aug 16 '22 at 14:17
  • I use the last code snippet in the post (Description class) – kofm Aug 17 '22 at 10:40

4 Answers4

2

First you should avoid multiple joins by aggregating desired filters upfront:

filters = [
  {'trait': 1, 'class': [2, 3]},
  {'trait': 2, 'class': [6,]},
]

queryset = Description.objects.all()
class_filter = []
for filter_entry in filters:
    class_filter.append(filter_entry["class"])
queryset = queryset.filter(expression_set__class__in=class_filter)

Second problem is scanning for text values. Use db_index=True on your Class.name field.

EDIT: There is a difference in chaining the filters on the same table and using Q objects. It does not act like AND on the same object. It seems counterintuitive as in the sql you can see AND but that is AND on multiple joins, where each join effectively duplicates the descriptions (that is why it get's slow). Best explained in the Django docs or this article.

Quick excerpt of the docs:

To select all blogs containing at least one entry from 2008 having “Lennon” in its headline (the same entry satisfying both conditions), we would write:

Blog.objects.filter(entry__headline__contains='Lennon', entry__pub_date__year=2008)

Otherwise, to perform a more permissive query selecting any blogs with merely some entry with “Lennon” in its headline and some entry from 2008, we would write:

Blog.objects.filter(entry__headline__contains='Lennon').filter(entry__pub_date__year=2008)

EDIT 2: Schematic example of above from this answer:

enter image description here

Blog.objects.filter(entry__headline_contains='Lennon', 
entry__pub_date__year=2008)

filters only Blog 1

Blog.objects.filter(entry__headline_contains='Lennon').filter(
entry__pub_date__year=2008)  

filters blog 1 and 2

preator
  • 984
  • 5
  • 6
  • This is not the same search. In the example you provided the Description instances returned will be those that have a related Expression of ID 2, 3, or 6. What I want is all the Description instances that have related Expression of 2 or 3 AND 6. Each single filter must be true. Also, I don't scan for text values. I'm using the IDs which are primary keys – kofm Aug 18 '22 at 14:19
  • please check the edit, in case you want it like 'Expression of 2 or 3 AND 6' use the Q & Q method, check the data for consistency why there are no results if you are expecting them. – preator Aug 19 '22 at 09:58
  • Well in my case I would like to filter both blog 1 and 2, so my only option is to chaining filter (with the corresponding performance issues?) – kofm Aug 22 '22 at 07:19
  • There is no silver bullet but you can try few approaches and see which gets better results. First I would try to run query for expression with classes (btw class is reserved keyword) 2 or 3 and then for expressions with class 6 returning description ids. Then intersect these ids using python sets. It will produce more but simpler queries. – preator Aug 22 '22 at 09:35
  • Also you can easily filter for descriptions with 2 or 3 or 6 which will narrow down the results, annotate the class ids using StringAgg and some delimiter and process the result in one forloop scan, picking just those that match 2 or 3 AND 6. – preator Aug 22 '22 at 09:42
  • I ended up performing separate queries for each filter, retrieving only IDs and then intersecting all the retrieved IDs. Performance improved by 10x. Thanks for your support. IDK if I should post the code. What do you suggest? – kofm Aug 23 '22 at 09:21
  • sure, it can help someone in the future – preator Aug 23 '22 at 09:32
1

It's slightly better to use multiple functions instead I think. It runs at the same speed as using classes, if not even faster. Check this question out. After you start using functions, you can try using @cached_property(func, name=None):

It’s common to have to call a class instance’s method more than once. If that function is expensive, then doing so can be wasteful.

Using the cached_property decorator saves the value returned by a property; the next time the function is called on that instance, it will return the saved value rather than re-computing it. Note that this only works on methods that take self as their only argument and that it changes the method to a property.

Consider a typical case, where a view might need to call a model’s method to perform some computation, before placing the model instance into the context, where the template might invoke the method once more:

# the model
class Person(models.Model):

    def friends(self):
        # expensive computation
        ...
        return friends

# in the view:
if person.friends():
    ...

And in the template you would have:

{% for friend in person.friends %}

Here, friends() will be called twice. Since the instance person in the view and the template are the same, decorating the friends() method with @cached_property can avoid that:

from django.utils.functional import cached_property

class Person(models.Model):

    @cached_property
    def friends(self):
        ...

Relevant questions and sources:

DialFrost
  • 1,610
  • 1
  • 8
  • 28
  • Well but the classes' data here is not computed, it's all data stored in database. Maybe I do not understand something? – kofm Aug 14 '22 at 11:57
  • I'm a bit confused here too :P, I was suggesting why can't you create a single class with all the functions inside! It will still work i'm pretty sure @kofm – DialFrost Aug 14 '22 at 11:59
  • There isn't any function inside the classes. These are Django ORM representation of database structure – kofm Aug 14 '22 at 12:07
1

To understand more about the queries, one can use Django Debug Toolbar. That's helpful to use because it's hard to know how we can improve if we're not able to measure the current state (as it appears to be the case).

Django has a page specific for database access optimization. In it, one can read, for instance, that QuerySets are lazy.

Since OP explored Django ORM and didn't get really good results from it, to improve performance OP may try to use raw SQL queries. In other words, write one's own SQL to retrieve the data. According to the documentation

Django gives you two ways of performing raw SQL queries: you can use Manager.raw() to perform raw queries and return model instances, or you can avoid the model layer entirely and execute custom SQL directly.

Another way to speed up the query can be to add indexes... The absence of such can slow down one's query.

Additionally, OP should consider using some cache, like MemCached. According to Alex Xu,

A cache is a temporary storage area that stores the result of expensive responses or frequently accessed data in memory so that subsequent requests are served more quickly. (...) The cache tier is a temporary data store layer, much faster than the database. The benefits of having a separate cache tier include better system performance, ability to reduce database workloads, and the ability to scale the cache tier independently.

Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83
  • Thanks for the suggestions. I used Django Debug Toolbar to identify that the most time-consuming query was the one illustrated in my post. If it could be of any help, I can post the resulting query – kofm Aug 17 '22 at 11:12
  • Also (and correct me if I'm wrong), I think cache doesn't make much sense here since we're talking about a *search* view that should be able to filter data throughout the entire database according to user input. – kofm Aug 17 '22 at 12:56
  • @kofm Django offers different levels of cache granularity. Using cache can help you indeed improve performance in the queries to come. If the data exists in the cache tier, then return from there; else, save the data in cache (there's different caching strategies too). – Gonçalo Peres Aug 17 '22 at 19:56
0

Different databases have different performance quirks, so without knowing what you're running against it's a bit of a shot in the dark, but have you tried Q-objects?

from django.db.models import QuerySet

q = Q()
for filter in filters:
    q |= Q(expression_set__class_in=filter['class'])
queryset.filter(q)

or simply pre-calculating the filter value (since you're already doing an in-query):

filtr = []
for f in filters:
    filtr += filter['class']
queryset.filter(expression_set__class_in=filtr)

or

queryset = Description.objects.filter(
    expression__class__in=Class.objects.filter(pk__in=filtr)  
)

(filtr as constructed above)

I'm assuming you've examined the query-analyzer output to make sure you're not missing any indexes etc...

Caching is probably not a good idea (since this is close to ad-hoc queries almost everything will be a cache miss), but 1000 * 40 * 5 probably fits in memory, so it might be a solution to read it all into memory if this is a time-critical piece that runs frequently.

thebjorn
  • 26,297
  • 11
  • 96
  • 138
  • I'm using PSQL. Regarding the first solution you proposed, it should reproduce the exact query by changing the OR operator I with AND operator. However I can't get it to work for some reason I do not understand. While this work: ```python Description.objects.filter(expressions__class__in=[731, 732]).filter(expressions__class__in=[734, ]) ``` This doesn't, i.e. it does not return any result (empty queryset) ```python Description.objects.filter(Q(expressions__class__in=[731, 732]) & Q(expressions__class__in=[734, ])) ``~ – kofm Aug 18 '22 at 14:54
  • I think caching is not an option since it may be enough for now, but the number of Descriptions is destined to grow – kofm Aug 18 '22 at 14:59