5

I have a Django 1.8 application, and I am using an MsSQL database, with pyodbc as the db backend (using "django-pyodbc-azure" module).

I have the following models:

class Branch(models.Model):
    name = models.CharField(max_length=30)
    startTime = models.DateTimeField()

class Device(models.Model):
    uid = models.CharField(max_length=100, primary_key=True)
    type = models.CharField(max_length=20)
    firstSeen = models.DateTimeField()
    lastSeen = models.DateTimeField()

class Session(models.Model):
    device = models.ForeignKey(Device)
    branch = models.ForeignKey(Branch)
    start = models.DateTimeField()
    end = models.DateTimeField(null=True, blank=True)

I need to query the session model, and I want to exclude some records with specific device values. So I issue the following query:

sessionCount = Session.objects.filter(branch=branch)
                          .exclude(device__in=badDevices)                                             
                          .filter(end__gte=F('start')+timedelta(minutes=30)).count()

badDevices is a pre-filled list of device ids with around 60 items.

badDevices = ['id-1', 'id-2', ...]

This query takes around 1.5 seconds to complete. If I remove the exclude from the query, it takes around 250 miliseconds.

I printed the generated sql for this queryset, and tried it in my database client. There, both versions executed in around 250 miliseconds.

This is the generated SQL:

SELECT [session].[id], [session].[device_id], [session].[branch_id], [session].[start], [session].[end] 
FROM [session] 
WHERE ([session].[branch_id] = my-branch-id AND 
NOT ([session].[device_id] IN ('id-1', 'id-2', 'id-3',...)) AND 
DATEPART(dw, [session].[start]) = 1 
AND [session].[end] IS NOT NULL AND 
[session].[end] >= ((DATEADD(second, 600, CAST([session].[start] AS datetime)))))

So, using the exclude in database level doesn't seem to be affecting the query performance, but in django, the query runs 6 times slower if I add the exclude part. What could be causing this?

Ozgur Akcali
  • 5,264
  • 2
  • 31
  • 49

2 Answers2

6

The general issue seems to be that django is doing some extra work to prepare the exclude clause. After that step and by the time the SQL has been generated and sent to the database, there isn't anything interesting happening on the django side that could cause such a significant delay.

In your case, one thing that might be causing this is some kind of pre-processing of badDevices. If, for instance, badDevices is a QuerySet then django might be executing the badDevices query just to prepare the actual query's SQL. Possibly something similar might be happening in the case where device has a non-default primary key.

The other thing might delay the SQL preparation is of course django-pyodbc-azure. Maybe it's doing something strange while compiling the query and it becomes a bottleneck.

This is all wild speculation though, so if you're still having this issue then post the Device and Branch models as well, the exact content of badDevices and the SQL generated from the queries. Then maybe some scenarios can be at least eliminated.

EDIT: I think it must be the Device.uid field. Possibly django or pyodbc is getting confused by the non-default primary key and is fetching all the devices while generating the query. Try two things:

  • Replace device__in with device_id__in, device__pk__in and device__uid__in and check each one again. Maybe a more explicit query will be easier for django to translate into SQL. You can even try replacing branch with branch_id, just in case.

  • If the above doesn't work, try replacing the exclude expression with a raw SQL where clause:

    # add quotes (because of the hyphens) & join
    badDevicesIdString = ", ".join(["'%s'" % id for id in badDevices])
    
    # Replaces .exclude()
    ... .extra(where=['device_id NOT IN (%s)' % badDevicesIdString])
    

If neither works, then most likely the problem is with the whole query and not just exclude. There are some more options in that case but try the above first and I will update my answer later if necessary.

gbs
  • 1,305
  • 13
  • 12
  • badDevices is a python list, not a query set, so most probably that's not affecting the query performance. Device has a non-default primary key, how would that affect the query performance? I've edited the question to add the other models and the generated sql. – Ozgur Akcali Oct 22 '15 at 09:56
  • 1
    replacing device__in clause with your suggestions made no difference, but your second suggestion solved the problem, performance of the query is as it should be now, thanks. Any ideas why this worked? – Ozgur Akcali Oct 22 '15 at 14:40
  • It's hard to say at this point without speculating, but the large delay can only be explained by a round-trip to the DB during the query compilation. I suspect that when the `__in` clause is being compiled, django sees a bunch of non-numeric ids, gets confused, SELECTs those `Device` records and then extracts the primary key again (or throws away the result). Issues like this are not very uncommon, especially with third-party backends, and they can't be caught by the test suite unfortunately. I'll try to look a bit further into it to see if I can find the exact cause. – gbs Oct 22 '15 at 15:14
3

Just want to share a similar problem that I had with MySQL and exclude clauses performance and how it was fixed.

When running the exclude clause, the list with the "in" lookup was actually a Queryset that I got using values_list method. Checking the exclude query executed by MySQL, the "in" objects were not values but actually another query. This behavior was impacting performance on specific large queries.

To fix that, instead of passing the queryset, I flat it out in a python list of values. By doing that, each value is passed as an argument inside the in lookup and the performance was really improved.

kikocastroneto
  • 136
  • 1
  • 3