1

This problem is related to a gaming arcade parlor where people go in the parlor and play a game. As a person plays, there is a new entry created in the database. My model is like this:

class gaming_machine(models.Model):
  machine_no = models.Integer()
  score = models.Integer()
  created = models.DateTimeField(auto_now_add=True)

My view is like this:

today = datetime.now().date()

# i am querying the db for getting the gaming_machine objects where score = 192 or 100 and the count of these objects separately for gaming_machines object which have 192 score and gaming_machine objects which have score as 100 

gaming_machine.objects.filter(Q(points=100) | Q(points=192),created__startswith=today).values_list('machine_no','points').annotate(Count('machine_no'))
# this returns a list of tuples -> (machine_no, points, count)
<QuerySet [(330, 192,2), (330, 100,4), (331, 192,7),(331,192,8)]>
  1. Can i change the returned queryset format to something like this: {(330, 192):2, (330, 100) :4, (331, 192):7,(331,192):8} # that is a dictionary with a key as a tuple consisting (machine_no,score) and value as count of such machine_nos
  2. I am aware that i can change the format of this queryset in the python side using something like dictionary comprehension, but i can't do that as it takes around 1.4 seconds of time to do that because django querysets are lazy.
Ash Singh
  • 3,921
  • 2
  • 25
  • 30
  • 2
    Querysets are lazy in the sense that they postpone the query, but once they have to query, they fetch all rows immediately. – Willem Van Onsem Jun 13 '18 at 10:22
  • It looks to me that the counting is probably the problem. Perhaps you should try to solve the problem at the database side, by making a table that stores the aggregate, and periodically updates it. – Willem Van Onsem Jun 13 '18 at 10:24
  • 1
    Furthermore you probably can boost your query by rewriting the `created__startswith` part, etc. – Willem Van Onsem Jun 13 '18 at 10:25
  • 2
    You could of course write your own queryset subclass with a dedicated method to get the expected format but that wouldn't change anything wrt/ performances. And your problem has nothing to do with querysets being lazy - "lazy" means they don't issue any db query until you first try to get results, not that they take a nap after each row they fetch... – bruno desthuilliers Jun 13 '18 at 11:02
  • @WillemVanOnsem: The counting problem was already solved in the query i wrote. It's just that i want a different structure of the return type queryset. I was wondering why is it so that when i run this query result in a loop and just print the tuples, it is really fast. But when i try to restructure it inside the loop instead of just priniting it becomes slow. – Ash Singh Jun 14 '18 at 06:28
  • @brunodesthuilliers : Can you please read my comment above ? and help me understand the behavior of returned queryset results. Also, as you mentioned it is possible to restructure the outcome of queryset, can you refer some useful link or something ? – Ash Singh Jun 14 '18 at 06:31
  • @jencko how querysets work is no black magic, and you already have the best possible documentation available: the source code. wrt/ "restructure the outcome of queryset", I'm afraid you didn't get my point, what I meant was that it wouldn't improve performances anyway. And finally I strongly suspect a XY problem, so I'd _really_ like to know _why_ you want to build this dict (hint: dict are useful for lookups, but you already have a SQL database and looking up data is something they surely know how to do). – bruno desthuilliers Jun 14 '18 at 07:57
  • @jencko: that's close to impossible: since Django needs to deserialize the answer of the database into tuples. The query can take long, since for example by default, there is no index created for `score`. So I think most gain can be used by optimizing the query. Yes, dictionary comprehension will take some time, but that is comparable to the deserializing process, so that will no significantly slow down the result. – Willem Van Onsem Jun 14 '18 at 10:45
  • @WillemVanOnsem: Thanks, can you suggest something for boosting the performance of this query ? how can i write it better ? I tried replacing Q(points=100) | Q(points=192) with something like points__in=[192,100] .. still the speed is slow, is there anything particularly inefficient that i am doing in this query ? – Ash Singh Jun 15 '18 at 02:55
  • @jencko: what if you add an index on `points`? – Willem Van Onsem Jun 15 '18 at 03:47
  • @WillemVanOnsem : index on 'points' ? – Ash Singh Jun 19 '18 at 08:53
  • @jencko: yes, a database can make a `MUL` index to perform fast filtering: https://docs.djangoproject.com/en/2.0/topics/db/optimization/ – Willem Van Onsem Jun 19 '18 at 09:12
  • In case the number of *distinct* points is large, building an index can boost performance, since then the database almost instantly knows what rows have a given amount of points (compare it with a dictionary lookup over linear search). – Willem Van Onsem Jun 19 '18 at 09:13

1 Answers1

1

Django's lazy queries...

but i can't do that as it takes around 1.4 seconds of time to do that because django querysets are lazy.

The laziness of Django's querysets actually has (close) to no impact on performance. They are lazy in the sense that they postpone querying the database until you need the result (for example when you start iterating over it). But then they will fetch all the rows. So there is no overhead in each time fetching the next row, all rows are fetched, and then Python iterates over it quite fast.

The laziness is thus not on a row-by-row basis: it does not advances the cursor each time you want to fetch the next row. The communication to the database is thus (quite) limited.

... and why it does not matter (performance-wise)

Unless the number of rows is huge (50'000 or more), the transition to a dictionary should also happen rather fast. So I suspect that the overhead is probably due to the query itself. Especially since Django has to "deserialize" the elements: turn the response into tuples, so although there can be some extra overhead, it usually will be reasonable compared to the work that already is done without the dictionary comprehension. Typically one encodes tasks in queries if they result in less data that is transferred to Python.

For example by performing the count at the database, the database will return an integer per row, instead of several rows, by filtering, we reduce the number of rows as well (since typically not all rows match a given criteria). Furthermore the database has typically fast lookup mechanisms that boost WHEREs, GROUP BYs, ORDER BYs, etc. But post-processing the stream to a different object would usually take the same magnitude of time for a database.

So the dictionary comprehension should do:

{
    d[:2]: d[3]
    for d in gaming_machine.objects.filter(
                  Q(points=100) | Q(points=192),created__startswith=today
             ).values_list(
                 'machine_no','points'
             ).annotate(
                 Count('machine_no')
             )
}

Speeding up queries

Since the problem is probably located at the database, you probably want to consider some possibilities for speedup.

Building indexes

Typically the best way to boost performance of queries, is by constructing an index on columns that you filter on frequently, and have a large number of distinct values.

In that case the database will construct a data structure that stores for every value of that column, a list of rows that match with that value. So as a result, instead of reading through all the rows and selecting the relevant ones, the database can instantly access the datastructure and typically know in reasonable time, what rows have that value.

Note that this typically only helps if the column contains a large number of distinct values: if for example the column only contains two values (in 1% of the cases the value is 0, and 99% of the cases are 1) and we filter on a very common value, this will not produce much speedup, since the set we need to process, has approximately the same size.

So depending on how distinct the values, are, we can add indices to the points, and created field:

class gaming_machine(models.Model):
  machine_no = models.Integer()
  score = models.Integer(db_index=True)
  created = models.DateTimeField(auto_now_add=True, db_index=True)

Improve the query

Secondly, we can also aim to improve the query itself, although this might be more database dependent (if we have two queries q1 and q2, then it is possible that q1 works faster than q2 on a MySQL database, and q2 works for example faster than q1 on a PostgreSQL database). So this is quite tricky: there are of course some things that typically work in general, but it is hard to give guarantees.

For example somtimes x IN (100, 192) works faster than x = 100 OR x = 192 (see here). Furthermore you here use __startswith, which might perform well - depending on how the database stores timestamps - but it can result in a computationally expensive query if it first needs to convert the datetime. Anyway, it is more declarative to use created__date, since it makes it clear that you want the date of the created equal to today, so a more efficient query is probably:

{
    d[:2]: d[3]
    for d in gaming_machine.objects.filter(
                  points__in=[100, 192], created__date=today
             ).values_list(
                 'machine_no','points'
             ).annotate(
                 Count('machine_no')
             )
}
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
  • Thanks for taking out time for answering this. I used django's connection.queries feature for checking the time it takes for querying. It hasn't still improved. You are right that the query is slow, it takes on avg 1.4 seconds to 1.7 seconds to only execute the query. I tried changing the date format as you suggested and using 'in' rather than 'or' , but still it didn't improve. – Ash Singh Jun 19 '18 at 10:56
  • The dictionary comprehension part does it's job pretty fast. So, formatting the returned data in the way i want is not a problem anymore, the query itself is the problem. – Ash Singh Jun 19 '18 at 10:57
  • @jencko: then perhaps you should create a table that stores the results of the query, and periodically updates these (if that is an option)? – Willem Van Onsem Jun 19 '18 at 10:58
  • Yeah, I think so. I can make solution writing a cron job that updates result in a different model, but, still it makes me wonder why this query is so slow. I wish i could just know the reason. – Ash Singh Jun 19 '18 at 11:06
  • @jencko: how many rows does the `gaming_machine` table contains? – Willem Van Onsem Jun 19 '18 at 11:07
  • Furthermore: did you make migrations, and run those migrations? Since `db_index=True` needs a migration. – Willem Van Onsem Jun 19 '18 at 11:08
  • around 11 million lines of data. – Ash Singh Jun 19 '18 at 11:08
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/173392/discussion-between-jencko-and-willem-van-onsem). – Ash Singh Jun 19 '18 at 11:09