0

I am working on a new project and had to build an outline of a few pages really quick.

I imported a catalogue of 280k products that I want to search through. I opted for Whoosh and Haystack to provide search, as I am using them on a previous project. I added definitions for the indexing and kicked off that process. However, it seems that Django is really, really really slow to iterate over the QuerySet. Initially, I thought the indexing was taking more than 24 hours - which seemed ridiculous, so I tested a few other things. I can now confirm that it would take many hours to iterate over the QuerySet.

Maybe there's something I'm not used to in Django 2.2? I previously used 1.11 but thought I use a newer version now.

The model I'm trying to iterate over:

class SupplierSkus(models.Model):
    sku = models.CharField(max_length=20)
    link = models.CharField(max_length=4096)
    price = models.FloatField()
    last_updated = models.DateTimeField("Date Updated", null=True, auto_now=True)
    status = models.ForeignKey(Status, on_delete=models.PROTECT, default=1)
    category = models.CharField(max_length=1024)
    family = models.CharField(max_length=20)
    family_desc = models.TextField(null=True)
    family_name = models.CharField(max_length=250)
    product_name = models.CharField(max_length=250)
    was_price = models.FloatField(null=True)
    vat_rate = models.FloatField(null=True)
    lead_from = models.IntegerField(null=True)
    lead_to = models.IntegerField(null=True)
    deliv_cost = models.FloatField(null=True)
    prod_desc = models.TextField(null=True)
    attributes = models.TextField(null=True)
    brand = models.TextField(null=True)
    mpn = models.CharField(max_length=50, null=True)
    ean = models.CharField(max_length=15, null=True)
    supplier = models.ForeignKey(Suppliers, on_delete=models.PROTECT)

and, as I mentioned, there are roughly 280k lines in that table.

When I do something simple as:

from products.models import SupplierSkus
sku_list = SupplierSkus.objects.all()
len(sku_list)

The process will quickly suck up most CPU power and does not finish. Likewise, I cannot iterate over it:

for i in sku_list:
    print(i.sku)

Will also just take hours and not print a single line. However, I can iterate over it using:

for i in sku_list.iterator():
    print(i.sku)

That doesn't help me very much, as I still need to do the indexing via Haystack and I believe that the issues are related.

This wasn't the case with some earlier projects I've worked with. Even a much more sizeable list (3-5m lines) would be iterated over quite quickly. A query for list length will take a moment, but return the result in seconds rather than hours.

So, I wonder, what's going on? Is this something someone else has come across?

dengar81
  • 2,485
  • 3
  • 18
  • 23
  • Python is not good with iterating 100k+ times. Usually 100k+ means processing time will be in seconds. Printing *all* 280k+ skus is however very unlikely. Usually you should be able to filter? – Willem Van Onsem Jul 15 '19 at 13:02
  • So, I'm using Python3 - forgot to mention. I don't actually mind "seconds", I mind "days". Something isn't right here. I've just tried the same again in Django 1.9.2 and the results are massively different. With a much longer QuerySet, Python starts iterating almost immediately. But in 2.2 or 2.0 there's no iterating at all. – dengar81 Jul 15 '19 at 13:07
  • did you JOIN in your queryset, by filtering on a related model? What if you add `.distinct()` at the end? – Willem Van Onsem Jul 15 '19 at 13:10
  • No @WillemVanOnsem. I definitely didn't filter. It's as you see it in my example, no joining, no related data. I also don't think iterating over 280k lines is all that much to work over. I frequently iterate over larger QuerySets. I also am quite a fan of doing something like: list(SupplierSkus.objects.all()) to turn the QS into a List right away. This seems impossible now. My 16 GB RAM, 2.9Ghz i5 is still occupied getting len(sku_list) for multiple hours, while my 4GB RAM, 1.3 Ghz i5 has counted over a much larger set in less than 5 minutes. I'm not sure if it's Django 1.9 vs 2.2 related? – dengar81 Jul 15 '19 at 13:17
  • You might want to use something similar to a Paginator for this. I believe the one the comes packaged with Django will most likely be good enough for your needs, but in case that doesn't work, it should be easy enough to create one yourself by just getting the total count and dividing that up into chunks of a few thousand. Let me know if you need some code. [EDIT: Just found something that would help!](https://stackoverflow.com/a/20285671/7970018) – Wiggy A. Jul 15 '19 at 13:25
  • Thanks @WiggyA. I appreciate that this may be a solution to the issue at hand, but I'm nowhere near understanding the issue in the first place. How come that QuerySets became so horribly inefficient? It has been hours to evaluate the simple length of a QuerySet. At the same time, a similar operation takes seconds in an earlier Django version. That doesn't seem right, correct? – dengar81 Jul 15 '19 at 13:41
  • You are asking the DB to return 280K records and there are two long CharFields that might be 5K. Django does lazy evaluation of objects from a queryset, but I'm not aware that this extends to breaking down an SQL query into slices of SQL query without you programming a sequence of queries. – nigel222 Jul 16 '19 at 08:46
  • You might try fetching less data with a queryset that returns `.values("sku",...)` rather than the entire table. Does that help? – nigel222 Jul 16 '19 at 08:51
  • @nigel222 thank you for your comments. The problem is with the search index - I need to get all of the data indexed and therefore need to pass the whole collection to Haystack. I am thinking about how I break this up into smaller chunks. However, and nobody has really addressed this: there's a big difference in the performance of Django 1.9 and 2. The difference means that many programmes I wrote will work fine on 1.9 but will not work on 2. And that's purely because Django can no longer iterate over large QuerySets - something that I did all the time. – dengar81 Jul 16 '19 at 21:38
  • Oh, and I found out that this isn't a problem when using straight-up SQL through django.utils.connection! Doing "select * ..." on the table and the programme iterates over it immediately. – dengar81 Jul 16 '19 at 21:40
  • I'm not an expert here! Slicing the queryset is the obvious way to work in chunks https://docs.djangoproject.com/en/2.2/topics/db/queries/#limiting-querysets or it might be better to use filtering by some field whose distribution is known. If this is a regression from 1.9 to 2 you might want to raise a bug report about it with the Django maintainers. At least they'd explain the reason for the slow-down if they reject fixing it. https://docs.djangoproject.com/en/dev/internals/contributing/bugs-and-features/ – nigel222 Jul 17 '19 at 08:10
  • Thank you for your comment @nigel222. I haven't resolved the issue I described here, albeit that didn't solve my indexing with whoosh issue. I think I will migrate to SOLR as a backend. The iterating issue is solved. – dengar81 Jul 23 '19 at 22:13

1 Answers1

0

Okay, I've found the problem was the Python MySQL driver. Without using the .iterator() method a for loop would get stuck on the last element in the QuerySet. I have posted a more detailed answer on an expanded question here.

I was not using the Django recommended mysqlclient. I was using the one created by Oracle/MySQL. There seems to be a bug that causes an iterator to get "stuck" on the last element of the QuerySet in a for loop and be trapped in an endless loop in certain circumstances.

Coming to think of it, it may well be that this is a design feature of the MySQL driver. I remember having a similar issue with a Java version of this driver before. Maybe I should just ditch MySQL and move to PostgreSQL?

I will try to raise a bug with Oracle anyways.

dengar81
  • 2,485
  • 3
  • 18
  • 23