10

I'm writing an app that will allow the user to upload data in a file; the app will process this data, and email the results to the user. Processing may take some time, so I would like to handle this separately in a Python script rather than wait in the view for it to complete. The Python script and view don't need to communicate as the script will pick up the data from a file written by the view. The view will just put up a message like "Thanks for uploading your data - the results will be emailed to you"

What's the best way to do this in Django? Spawn off a separate process? Put something on a queue?

Some example code would be greatly appreciated. Thanks.

FunLovinCoder
  • 7,597
  • 11
  • 46
  • 57

4 Answers4

20

The simplest possible solution is to write a custom commands that searches for all the un-processed files, processes them and then emails the user. The management commands runs inside the Django framework so they have access to all models, db connections, etc, but you can call them from wherever, for example crontab.

If you care about the timeframe between the file has been uploaded and processing starts, you could use a framework like Celery, which is basically a helper library for using a message queue and running workers listening in on the queue. This would be pretty low latency, but on the other hand, simplicity might be more important for you.

I would strongly advice against starting threads or spawning processes in your views, as the threads would be running inside the django process and could destroy your webserver(depending on your configuration). The child process would inherit everything from the Django process, which you probably don't want. It is better to keep this stuff separate.

knutin
  • 5,033
  • 19
  • 26
  • Yes, I want it to be as simple as possible but hadn't realised the results of spawning a process or thread could be so serious. Thanks for the heads up. – FunLovinCoder Nov 27 '10 at 14:35
4

I currently have a project with similar requirements (just more complicated^^).

Never spawn a subprocess or thread from your Django view. You have no control of the Django processes and it could be killed, paused etc before the end of the task. It is controlled by the web server (e.g. apache via WSGI).

What I would do is an external script, which would run in a separate process. You have two solutions I think :

  • A process that is always running and crawling the directory where you put your files. It would for example check the directory every ten seconds and process the files
  • Same as above, but run by cron every x seconds. This basically has the same effect
  • Use Celery to create worker processes and add jobs to the queue with your Django application. Then you will need to get the results back by one of the means available with Celery.

Now you probably need to access the information in Django models to email the user in the end. Here you have several solutions :

  • Import your modules (models etc) from the external script
  • Implement the external script as a custom command (as knutin suggested)
  • Communicate the results to the Django application via a POST request for example. Then you would do the email sending and status changes etc in a normal Django view.

I would go for an external process and import the modules or POST request. This way it is much more flexible. You could for example make use of the multiprocessing module to process several files in the same time (thus using multi-core machines efficiently).

A basic workflow would be:

  1. Check the directory for new files
  2. For each file (can be parallelized):
    1. Process
    2. Send email or notify your Django application
  3. Sleep for a while

My project contains really CPU-demanding processing. I currently use an external process that gives processing jobs to a pool of worker processes (that's basically what Celery could do for you) and reports the progress and results back to the Django application via POST requests. It works really well and is relatively scalable, but I will soon change it to use Celery on a cluster.

Marc Demierre
  • 1,088
  • 13
  • 24
  • Thanks for the great feedback. I may need to kick off multiple threads for large files so will look at the multiprocessing module. – FunLovinCoder Nov 27 '10 at 15:12
  • If the processing is CPU-limited, you have to use processes (e.g. with the processing module) and not threads (threading module). The Python Global Interpreter Lock prevents threads to run truly in parallel, so there is no performance increase (I realized it when doing my project). – Marc Demierre Nov 27 '10 at 15:17
  • Also, Celery could be the way to go, as it does a lot of the work I described automatically. The only thing is that you have to figure out how to get results as you can't wait for the task to finish (an HTTP callback could do it easily I think). – Marc Demierre Nov 27 '10 at 15:22
3

You could spawn a thread to do the processing. It wouldn't really have much to do with Django; the view function would need to kick off the worker thread and that's it.

If you really want a separate process, you'll need the subprocess module. But do you really need to redirect standard I/O or allow external process control?

Example:

from threading import Thread
from MySlowThing import SlowProcessingFunction # or whatever you call it

# ...

Thread(target=SlowProcessingFunction, args=(), kwargs={}).start()

I haven't done a program where I didn't want to track the threads' progress, so I don't know if this works without storing the Thread object somewhere. If you need to do that, it's pretty simple:

allThreads = []

# ...

global allThreads
thread = Thread(target=SlowProcessingFunction, args=(), kwargs={})
thread.start()
allThreads.append(thread)

You can remove threads from the list when thread.is_alive() returns False:

def cull_threads():
    global allThreads
    allThreads = [thread for thread in allThreads if thread.is_alive()]
Mike DeSimone
  • 41,631
  • 10
  • 72
  • 96
1

You could use multiprocessing. http://docs.python.org/library/multiprocessing.html

Essentially,

def _pony_express(objs, action, user, foo=None):
    # unleash the beasts

def bulk_action(request, t):

    ...
    objs = model.objects.filter(pk__in=pks)

    if request.method == 'POST':
        objs.update(is_processing=True)

        from multiprocessing import Process
        p = Process(target=_pony_express, args=(objs, action, request.user), kwargs={'foo': foo})
        p.start()

        return HttpResponseRedirect(next_url)

    context = {'t': t, 'action': action, 'objs': objs, 'model': model}
    return render_to_response(...)
Skylar Saveland
  • 11,116
  • 9
  • 75
  • 91