1

I'm working on a fairly simple CGI with Python. I'm about to put it into Django, etc. The overall setup is pretty standard server side (i.e. computation is done on the server):

  1. User uploads data files and clicks "Run" button
  2. Server forks jobs in parallel behind the scenes, using lots of RAM and processor power. ~5-10 minutes later (average use case), the program terminates, having created a file of its output and some .png figure files.
  3. Server displays web page with figures and some summary text

I don't think there are going to be hundreds or thousands of people using this at once; however, because the computation going on takes a fair amount of RAM and processor power (each instance forks the most CPU-intensive task using Python's Pool).

I wondered if you know whether it would be worth the trouble to use a queueing system. I came across a Python module called beanstalkc, but on the page it said it was an "in-memory" queueing system.

What does "in-memory" mean in this context? I worry about memory, not just CPU time, and so I want to ensure that only one job runs (or is held in RAM, whether it receives CPU time or not) at a time.

Also, I was trying to decide whether

  • the result page (served by the CGI) should tell you it's position in the queue (until it runs and then displays the actual results page)

    OR

  • the user should submit their email address to the CGI, which will email them the link to the results page when it is complete.

What do you think is the appropriate design methodology for a light traffic CGI for a problem of this sort? Advice is much appreciated.

user
  • 7,123
  • 7
  • 48
  • 90
  • 2
    If you're going to put it into Django, then you could just put entries in a queue table in the database and have an external process poll the table and do whatever work is required, sequentially. Your UI could update the user on the progress of the item in the queue by reading its status from the database table. Just a thought :) – ed. Oct 02 '11 at 21:14
  • +1 for @ed. You can definitely throw the data into a database table, and let the external app work off that. And your client-side can just be given the constant status of the queue until its completed. Also, you could look at django's Signals for triggering things to happen on certain events: https://docs.djangoproject.com/en/1.3/topics/signals/ – jdi Oct 02 '11 at 21:46
  • I would use [celery](http://celeryproject.org/). – jterrace Oct 02 '11 at 21:55

1 Answers1

1

Definitely use celery. You can run an amqp server or I think you can sue the database as a queue for the messages. It allows you to run tasks in the background and it can use multiple worker machines to do the processing if you want. It can also do cron jobs that are database based if you use django-celery

It's as simple as this to run a task in the background:

@task
def add(x, y):
    return x + y

In a project I have it's distributing the work over 4 machines and it works great.

Matt Williamson
  • 39,165
  • 10
  • 64
  • 72
  • Do you know how it will handle 10 users submitting jobs at similar times? Does it automatically queue users until resources are available? That's definitely my main concern. – user Oct 03 '11 at 01:27
  • For django-celery just set CELERY_CONCURRENCY = 1 in settings.py. This will allow only one task to run at a time. – Matt Williamson Oct 03 '11 at 01:38
  • 1
    For pure python, try this article to ensure one task running at a time: http://ask.github.com/celery/cookbook/tasks.html#ensuring-a-task-is-only-executed-one-at-a-time – Matt Williamson Oct 03 '11 at 01:39
  • In the past I would have hacked this somehow; by using the database as a queue, do you know how I would upload a file (i.e., is there a way to upload the entire file into the database, or do I have to digest it into entries somehow)? I've simply used POST before to dump to a named tmpfile with Python. – user Oct 03 '11 at 15:54
  • 1
    I would not store a file in a relational database. What I would do is move the uploaded file to some storage folder and a random name and store the path of the file in the task or a database record you are working with and access it that way. Delete it when you are done. – Matt Williamson Oct 03 '11 at 15:56
  • Ah, I see! That makes sense. And then I have some other job that is polling the database (or that I start using celery), that will dequeue the first filename, etc. and run the actual results? If so, that sounds great! – user Oct 03 '11 at 16:07
  • I think you'd just have the task take care of all processing and then a view that shows the results and auto refresh it. Otherwise you'll have to use [APE](http://www.ape-project.org/) or something to push to the browser. – Matt Williamson Oct 03 '11 at 17:10