I am working on a new project and had to build an outline of a few pages really quick.
I imported a catalogue of 280k products that I want to search through. I opted for Whoosh and Haystack to provide search, as I am using them on a previous project. I added definitions for the indexing and kicked off that process. However, it seems that Django is really, really really slow to iterate over the QuerySet. Initially, I thought the indexing was taking more than 24 hours - which seemed ridiculous, so I tested a few other things. I can now confirm that it would take many hours to iterate over the QuerySet.
Maybe there's something I'm not used to in Django 2.2? I previously used 1.11 but thought I use a newer version now.
The model I'm trying to iterate over:
class SupplierSkus(models.Model):
sku = models.CharField(max_length=20)
link = models.CharField(max_length=4096)
price = models.FloatField()
last_updated = models.DateTimeField("Date Updated", null=True, auto_now=True)
status = models.ForeignKey(Status, on_delete=models.PROTECT, default=1)
category = models.CharField(max_length=1024)
family = models.CharField(max_length=20)
family_desc = models.TextField(null=True)
family_name = models.CharField(max_length=250)
product_name = models.CharField(max_length=250)
was_price = models.FloatField(null=True)
vat_rate = models.FloatField(null=True)
lead_from = models.IntegerField(null=True)
lead_to = models.IntegerField(null=True)
deliv_cost = models.FloatField(null=True)
prod_desc = models.TextField(null=True)
attributes = models.TextField(null=True)
brand = models.TextField(null=True)
mpn = models.CharField(max_length=50, null=True)
ean = models.CharField(max_length=15, null=True)
supplier = models.ForeignKey(Suppliers, on_delete=models.PROTECT)
and, as I mentioned, there are roughly 280k lines in that table.
When I do something simple as:
from products.models import SupplierSkus
sku_list = SupplierSkus.objects.all()
len(sku_list)
The process will quickly suck up most CPU power and does not finish. Likewise, I cannot iterate over it:
for i in sku_list:
print(i.sku)
Will also just take hours and not print a single line. However, I can iterate over it using:
for i in sku_list.iterator():
print(i.sku)
That doesn't help me very much, as I still need to do the indexing via Haystack and I believe that the issues are related.
This wasn't the case with some earlier projects I've worked with. Even a much more sizeable list (3-5m lines) would be iterated over quite quickly. A query for list length will take a moment, but return the result in seconds rather than hours.
So, I wonder, what's going on? Is this something someone else has come across?