Django: Should I kick off a separate process?

Question

I'm writing an app that will allow the user to upload data in a file; the app will process this data, and email the results to the user. Processing may take some time, so I would like to handle this separately in a Python script rather than wait in the view for it to complete. The Python script and view don't need to communicate as the script will pick up the data from a file written by the view. The view will just put up a message like "Thanks for uploading your data - the results will be emailed to you"

What's the best way to do this in Django? Spawn off a separate process? Put something on a queue?

Some example code would be greatly appreciated. Thanks.

what if some error occurs, while you are processing the data? — Srikar Appalaraju, Nov 27 '10 at 14:09
I will email them about it. I can't expect them to wait on the web page until it's finished as it could take an 20 mins or more. — FunLovinCoder, Nov 27 '10 at 14:15

score 20 · Accepted Answer · answered Nov 27 '10 at 14:29

The simplest possible solution is to write a custom commands that searches for all the un-processed files, processes them and then emails the user. The management commands runs inside the Django framework so they have access to all models, db connections, etc, but you can call them from wherever, for example crontab.

If you care about the timeframe between the file has been uploaded and processing starts, you could use a framework like Celery, which is basically a helper library for using a message queue and running workers listening in on the queue. This would be pretty low latency, but on the other hand, simplicity might be more important for you.

I would strongly advice against starting threads or spawning processes in your views, as the threads would be running inside the django process and could destroy your webserver(depending on your configuration). The child process would inherit everything from the Django process, which you probably don't want. It is better to keep this stuff separate.

Yes, I want it to be as simple as possible but hadn't realised the results of spawning a process or thread could be so serious. Thanks for the heads up. — FunLovinCoder, Nov 27 '10 at 14:35

Marc Demierre · Answer 2 · 2010-11-27T15:15:35.820

I currently have a project with similar requirements (just more complicated^^).

Never spawn a subprocess or thread from your Django view. You have no control of the Django processes and it could be killed, paused etc before the end of the task. It is controlled by the web server (e.g. apache via WSGI).

What I would do is an external script, which would run in a separate process. You have two solutions I think :

A process that is always running and crawling the directory where you put your files. It would for example check the directory every ten seconds and process the files
Same as above, but run by cron every x seconds. This basically has the same effect
Use Celery to create worker processes and add jobs to the queue with your Django application. Then you will need to get the results back by one of the means available with Celery.

Now you probably need to access the information in Django models to email the user in the end. Here you have several solutions :

Import your modules (models etc) from the external script
Implement the external script as a custom command (as knutin suggested)
Communicate the results to the Django application via a POST request for example. Then you would do the email sending and status changes etc in a normal Django view.

I would go for an external process and import the modules or POST request. This way it is much more flexible. You could for example make use of the multiprocessing module to process several files in the same time (thus using multi-core machines efficiently).

A basic workflow would be:

Check the directory for new files
For each file (can be parallelized):
1. Process
2. Send email or notify your Django application
Sleep for a while

My project contains really CPU-demanding processing. I currently use an external process that gives processing jobs to a pool of worker processes (that's basically what Celery could do for you) and reports the progress and results back to the Django application via POST requests. It works really well and is relatively scalable, but I will soon change it to use Celery on a cluster.

Thanks for the great feedback. I may need to kick off multiple threads for large files so will look at the multiprocessing module. — FunLovinCoder, Nov 27 '10 at 15:12
If the processing is CPU-limited, you have to use processes (e.g. with the processing module) and not threads (threading module). The Python Global Interpreter Lock prevents threads to run truly in parallel, so there is no performance increase (I realized it when doing my project). — Marc Demierre, Nov 27 '10 at 15:17
Also, Celery could be the way to go, as it does a lot of the work I described automatically. The only thing is that you have to figure out how to get results as you can't wait for the task to finish (an HTTP callback could do it easily I think). — Marc Demierre, Nov 27 '10 at 15:22

Mike DeSimone · Answer 3 · 2010-11-27T14:50:54.470

3

You could spawn a thread to do the processing. It wouldn't really have much to do with Django; the view function would need to kick off the worker thread and that's it.

If you really want a separate process, you'll need the subprocess module. But do you really need to redirect standard I/O or allow external process control?

Example:

from threading import Thread
from MySlowThing import SlowProcessingFunction # or whatever you call it

# ...

Thread(target=SlowProcessingFunction, args=(), kwargs={}).start()

I haven't done a program where I didn't want to track the threads' progress, so I don't know if this works without storing the Thread object somewhere. If you need to do that, it's pretty simple:

allThreads = []

# ...

global allThreads
thread = Thread(target=SlowProcessingFunction, args=(), kwargs={})
thread.start()
allThreads.append(thread)

You can remove threads from the list when thread.is_alive() returns False:

def cull_threads():
    global allThreads
    allThreads = [thread for thread in allThreads if thread.is_alive()]

edited Nov 27 '10 at 14:50

answered Nov 27 '10 at 13:58

Mike DeSimone

41,631
10
72
96

I just need kick off a python script and forget about it. I don't need to communicate with it as it will pick up the data from a file I wrote in the view. Could you add some example code please? – FunLovinCoder Nov 27 '10 at 14:00
Is it a completely separate Python script, or can you import it as a module and call it? – Mike DeSimone Nov 27 '10 at 14:42
@Mike DeSimone. It's a completely separate Python script, but after feedback from knutin probably best not to spawn a proc or thread. – FunLovinCoder Nov 27 '10 at 14:46
@Mike DeSimone. Thanks for the example code. I'm also looking at custom admin commands option now also. – FunLovinCoder Nov 27 '10 at 14:53
threads kill the runserver, do they not? this strategy is better implemented with multiprocess, no? – Skylar Saveland Jan 12 '11 at 05:27
Please explain what you mean by "the runserver". – Mike DeSimone Jan 12 '11 at 15:20

score 1 · Answer 4 · answered Jan 12 '11 at 05:34

You could use multiprocessing. http://docs.python.org/library/multiprocessing.html

Essentially,

def _pony_express(objs, action, user, foo=None):
    # unleash the beasts

def bulk_action(request, t):

    ...
    objs = model.objects.filter(pk__in=pks)

    if request.method == 'POST':
        objs.update(is_processing=True)

        from multiprocessing import Process
        p = Process(target=_pony_express, args=(objs, action, request.user), kwargs={'foo': foo})
        p.start()

        return HttpResponseRedirect(next_url)

    context = {'t': t, 'action': action, 'objs': objs, 'model': model}
    return render_to_response(...)

Django: Should I kick off a separate process?

4 Answers4

Linked