1

I have this database filled with almost 2 million records, a lot of these are duplicated (different internal id but same 'idd')

Basically what I need is to query the database to get the first x records with only 1 product (distinct idd) and ordered by '-sold_count'

So this is the query I'm trying:

Product.objects.distinct('idd').order_by('idd', '-sold_count')[:2000]

Problem 1: That takes roughly 30 seconds, and I fear it will take longer with more data. How can I make it much faster?

Problem 2: Also it doesn't order it by sold_count descending, if anything it seems to be doing the opposite. How can I order it by -sold_count? Tried removing the '-' to experiment but results seem to be the same.

Other information:

idd is a CharField

sold_count is an IntegerField

Hope this is clear enough, if not you can ask any question you may have.

Models:

class Product(models.Model):
    price = models.FloatField(default=False, null=True)
    sold_count = models.IntegerField(default=False, null=True)
    revenue = models.FloatField(default=False, null=True)
    idd = models.CharField(max_length=300, default=False, null=True)
    remaining = models.IntegerField(default=False, null=True)
    category = models.CharField(max_length=300, default=False, null=True)
    brand = models.CharField(max_length=300, default=False, null=True)
    seller = models.CharField(max_length=300, default=False, null=True)
    url = models.CharField(max_length=300, default=False, null=True)
    name = models.CharField(max_length=300, default=False, null=True)
    shipment = models.CharField(max_length=300, default=False, null=True)
    view_count = models.IntegerField(default=False, null=True)
    updated = models.DateTimeField(null=True)

Raw Query sent:

SELECT DISTINCT ON ("server_gitti"."idd") "server_gitti"."id", "server_gitti"."price", "server_gitti"."sold_count", "server_gitti"."revenue", "server_gitti"."idd", "server_gitti"."remaining", "server_gitti"."category", "server_gitti"."brand", "server_gitti"."seller", "server_gitti"."url", "server_gitti"."name", "server_gitti"."shipment", "server_gitti"."view_count", "server_gitti"."updated" FROM "server_gitti" ORDER BY "server_gitti"."idd" ASC, "server_gitti"."sold_count" ASC LIMIT 2000
madprops
  • 3,909
  • 5
  • 34
  • 42

1 Answers1

3

Well, one of the first things that I'd do, if I know I'm going to be searching on TextFields or CharField columns, is set an index:

class Product(models.Model):
    price = models.FloatField(default=False, null=True)
    sold_count = models.IntegerField(default=False, null=True)
    revenue = models.FloatField(default=False, null=True)
    idd = models.CharField(max_length=300, default=False, null=True, db_index=True)
    remaining = models.IntegerField(default=False, null=True)
    category = models.CharField(max_length=300, default=False, null=True)
    brand = models.CharField(max_length=300, default=False, null=True)
    seller = models.CharField(max_length=300, default=False, null=True)
    url = models.CharField(max_length=300, default=False, null=True)
    name = models.CharField(max_length=300, default=False, null=True)
    shipment = models.CharField(max_length=300, default=False, null=True)
    view_count = models.IntegerField(default=False, null=True)
    updated = models.DateTimeField(null=True)

Django rocks uses a 2000 row example, entertainingly

You may also want to have a look at the SQL that Django's running - if you have extra methods on that class (or related classes), you might be triggering more SQL queries than you realise. On the face of it, your query is pretty straightforward though, and that seems slow. I run joins across a couple of million records that take that long on our crummy legacy database. I also had a similar issue on get_or_create queries on a database of tweets between those two sizes, which was solved by indexing.

Edit: Should have added the standard django docs on optimisation which has a couple of things to look out for as well.

Edit2: It looks like select distinct is generally fairly slow, and is a known problem on postgres. Debugging through connection queries is likely to yield some interesting results, though.

Community
  • 1
  • 1
Withnail
  • 3,128
  • 2
  • 30
  • 47
  • Thanks, I added db_index=True to idd and sold_count and migrated the changes. It took a while to migrate the changes. But it seems to not have affected the speed at all :/ – madprops Dec 13 '16 at 12:12
  • Ugh. :( I was doing more background reading as I'm quite interested in this question - [this isn't encouraging, though.](https://www.postgresql.org/message-id/1203934747.4252.10.camel@ebony.site). Updated with some further background - have you run connections.queries, yet? – Withnail Dec 13 '16 at 12:16
  • Hmm what exactly should I do with connection.queries? I just ran it and it showed a sql query. Tried to do the query after that, seemed to take the same time. – madprops Dec 13 '16 at 12:32
  • 1
    on postgresql after adding indexes you sometimes need to do VACUUM ANALYZE on that table – e4c5 Dec 13 '16 at 12:44
  • connection queries will show you whether it's running a fairly simple `distinct` command, per the other question I linked, or if it's running a more complicated/unnecessary. Could be worth showing the raw SQL of what's being run in the question to help us/others run down the problem. – Withnail Dec 13 '16 at 12:48
  • Added the query to the question. Which i got from print queryset.query – madprops Dec 13 '16 at 13:02
  • Ran VACUUM ANALYZE. Didn't seem to improve it – madprops Dec 13 '16 at 13:04