2

I have a Celery task that runs periodically and may update a few objects on each run (order of magnitude ~10 would be the most per run). I have 2 other steps that need to occur if models are updated by that Celery task, and had planned to use the Django Model post_save signal to accomplish them.

However, I realized that handing update of a large number of objects (potentially tens or hundreds of thousands) in a post_save handler might not be ideal.

Here's the flow of what I'm trying to accomplish, with some example information:

Celery task, every 30s:
Update Book object based on conditions
                 |
post_save(Book)  |  
                 V
Update some field on all Reader objects w/ foreign key to updated Book 
                 |(update will have different results per-Reader. call model method?)
post_save(Reader)|
                 V
Update some fields on all PublisherStats objects connected to Reader (not direct FK)
                  (fields updated via method) 

I hope this makes sense- basically, a chain of updates needs to happen if the conditions in the first task are met. My initial task works, and I've hooked in the post_save based on the model, but realizing the scale of what could be updated in steps 2 and 3 gave me pause.

Would using a chain of Celery tasks be the best way to accomplish this? My gut feeling is that handling the huge updates in steps 2 and 3 in a post_save would be very slow. I was also considering having the post_save just call some other method, but at that point it's like a non-Celery task.

EDIT: After thinking about this for a bit, I think I'm going to have to move the 3rd task somewhere else- I can see how some concurrency issues could arise if updating the fields on the PublisherStats objects took longer than 30 seconds, and then the Reader objects got updated again and kicked off yet another update of PublisherStats objects.

The questions still stands, as I have no idea which method is better for step 2. I was thinking I could just fold it into the first Celery task and do something like:

if book.field == new_value:
    readers = Reader.objects.filter(...).reader_update()

where reader_update() is a Reader model method, but it doesn't look like that's possible without a custom Queryset. Is there some way to call a model method on an entire queryset? I know it's possible to do:

readers = Reader.objects.filter(...).update(field = new_value)

and this would be similar, conceptually. Maybe I'm wrong about that, though.

Basically, there seem to be a bunch of ways to go about this and I just don't want to pick a slow one!

dkhaupt
  • 2,220
  • 3
  • 23
  • 37
  • Your question is way too broad and there are too many unknowns. It is not possible to know what approach will perform best - it does not just depend on how you structure your celery workflow but also on external factors like your database design, database table interrelationships and so on. The best thing here is to prototype some different approaches and see which ones perform acceptably. – scytale Feb 29 '16 at 10:54
  • Fair enough- I believe I've refined the question here into something a lot more focused in this question: http://stackoverflow.com/questions/35690484/django-updating-many-objects-with-per-object-calculation as the core of this is really how to update a ton of objects that need more than just `field = new_value` – dkhaupt Feb 29 '16 at 11:15

0 Answers0