4

Newb quesion about Django app design:

Im building reporting engine for my web-site. And I have a big (and getting bigger with time) amounts of data, and some algorithm which must be applied to it. Calculations promise to be heavy on resources, and it would be stupid if they are performed by requests of users. So, I think to put them into background process, which would be executed continuously and from time to time return results, which could be feed to Django views-routine for producing html output by demand.

And my question is - what proper design approach for building such system? Any thoughts?

Gill Bates
  • 14,330
  • 23
  • 70
  • 138

3 Answers3

3

Celery is one of your best choices. We are using it successfully. It has a powerful scheduling mechanism - you can either schedule tasks as a timed job or trigger tasks in background when user (for example) requests it.

It also provides ways to query for the status of such background tasks and has a number of flow control features. It allows for a very easy distribution of the work - i.e your celery background tasks can be run on a separate machine (this is very useful for example with heroku web/workers split where web process is limited to max 30s per request). It provides various queue backends (it can use database, rabbitMQ or a number of other queuing mechanisms. With simplest setup it can use the same database that your Django site already uses for that (which makes it easy to setup).

And if you are using automated tests it also has a feature that helps with testing - it can be set in "eager" mode, where background tasks are not executed in background - thus giving predictable logic testing.

More info here: http://docs.celeryproject.org:8000/en/latest/django/

Jarek Potiuk
  • 19,317
  • 2
  • 60
  • 61
0

You mean the results are returned into a database or do you want to create django-views directly from your independently running code?

If you have large amounts of data I like to use Pythons multiprocessing. You can create a Generator which fills a JoinableQueue with the different tasks to do and a pool of Workers consuming the different Tasks. This way you should be able to maximize the resource utilization on your system.

The multiprocessing module also allows you to do several tasks over the network (e.g. multiprocessing.Manager()). With this in mind you should easily be able to scale things up if you need a second machine to process the data in time.

Example:

This example shows how to spawn multiple processes. The generator function should query the database for all new entries that need heavy lifting. The consumers take the individual items from the queue and do the actual calculations.

import time 

from multiprocessing.queues import JoinableQueue
from multiprocessing import Process

QUEUE = JoinableQueue(-1)

def generator():
    """ Puts items in the queue. For example query database for all new, 
    unprocessed entries that need some serious math done.."""
    while True: 
        QUEUE.put("Item")
        time.sleep(0.1)


def consumer(consumer_id):
    """ Consumes items from the queue... Do your calculations here... """
    while True: 
        item = QUEUE.get()
        print "Process %s has done: %s" % (consumer_id, item)
        QUEUE.task_done()


p = Process(target=generator)
p.start()

for x in range(0, 2): 
    w = Process(target=consumer, args=(x,))
    w.start()

p.join()
w.join()
chkorn
  • 98
  • 6
  • "You mean the results are returned into a database or do you want to create django-views directly from your independently running code?" Than seamlessly they will be integrated than better. – Gill Bates Dec 08 '12 at 20:13
  • Since you said that the amount of data that needs to be processed is big I would suggest you save the data into a database and query the database from django. This way you don't need to create an extra way of communication if you use multiprocessing and the data is persisted. You can only use parts of django to make the db access from the other code-parts as discussed in http://stackoverflow.com/questions/937742/use-django-orm-as-standalone and http://stackoverflow.com/questions/302651/use-only-some-parts-of-django for example. – chkorn Dec 08 '12 at 20:41
  • 1
    My confusion is in how that background process can be called within django, takin in attention that it must be unbounded from request - how this process can be started in server startup (or by schedule), and executed continously? – Gill Bates Dec 08 '12 at 20:54
  • Ah! Now I get it. Sorry for the confusion. I suggest you just set a flag in your database that says "Work needs to be done". The background task(s) then just queries your database for the list of open tasks. I have added an example with a consumer/generator that should demonstrate the principle a little bit. This proccess would be started by hand or as a Service/Daemon. Making it run continuously can be done by an infinite loop (e.g. `while True:`). Just make sure you have some kind of stop-condition so that you can terminate gracefully :) – chkorn Dec 08 '12 at 21:26
  • Can this process be putted in Django runtime, or its bad decision? ps Thanks you for example by the way! – Gill Bates Dec 08 '12 at 21:36
  • So that you can access your methods/functions/etc.? Yes, but you don't have to. Getting these things working should be described in the linked questions above. What you can't do is calling this code as a view or from a view. My suggestion requires a stand alone process. Sorry – chkorn Dec 08 '12 at 21:53
0

Why don't you have a url or python script that triggers whatever sort of calculation you need to have done everytime it's run and then fetch that url or run that script via a cronjob on the server? From what your question was it doesn't seem like you need a whole lot more than that.

hkothari
  • 254
  • 3
  • 14