0

In a Django Python app, I launch jobs with Celery (a task manager). When each job is launched, they return an object (lets call it an instance of class X) that lets you check on the job and retrieve the return value or errors thrown.

Several people (someday, I hope) will be able to use this web interface at the same time; therefore, several instances of class X may exist at the same time, each corresponding to a job that is queued or running in parallel. It's difficult to come up with a way to hold onto these X objects because I cannot use a global variable (a dictionary that allows me to look up each X objects from a key); this is because Celery uses different processes, not just different threads, so each would modify its own copy of the global table, causing mayhem.

Subsequently, I received the great advice to use memcached to share the memory across the tasks. I got it working and was able to set and get integer and string values between processes.

The trouble is this: after a great deal of debugging today, I learned that memcached's set and get don't seem to work for classes. This is my best guess: Perhaps under the hood memcached serializes objects to the shared memory; class X (understandably) cannot be serialized because it points at live data (the status of the job), and so the serial version may be out of date (i.e. it may point to the wrong place) when it is loaded again.

Attempts to use a SQLite database were similarly fruitless; not only could I not figure out how to serialize objects as database fields (using my Django models.py file), I would be stuck with the same problem: the handles of the launched jobs need to stay in RAM somehow (or use some fancy OS tricks underneath), so that they update as the jobs finish or fail.

My best guess is that (despite the advice that thankfully got me this far) I should be launching each job in some external queue (for instance Sun/Oracle Grid Engine). However, I couldn't come up with a good way of doing that without using a system call, which I thought may be bad style (and potentially insecure).

How do you keep track of jobs that you launch in Django or Django Celery? Do you launch them by simply putting the job arguments into a database and then have another job that polls the database and runs jobs?

Thanks a lot for your help, I'm quite lost.

Community
  • 1
  • 1
user
  • 7,123
  • 7
  • 48
  • 90
  • I still don't understand completely. Is your problem the class or the instance objects? Did you try to use Python's pickle module to serialize and deserialize the objects/classes? Is it necessary to serialize the whole object or would it just enough to save fro example the pk somehwere or splitting the objects attributes into different parts? – Torsten Engelbrecht Oct 18 '11 at 03:41
  • The trouble is that the data referred to may change, and so when it's serialized back in, it won't be current. Basically, it requires that somehow Celery (or the task manager used) update the database or collection of serialized objects atomically when it changes. – user Oct 18 '11 at 15:07

1 Answers1

1

I think django-celery does this work for you. Did you had a look at the tables made by django-celery? I.e. djcelery_taskstate holds all data for a given task like state, worker_id and so on. For periodic tasks there is a table called djcelery_periodictask.

In a Django view you can access the TaskMeta object:

from djcelery.models import TaskMeta
task = TaskMeta.objects.get(task_id=task_id)
print task.status
Reto Aebersold
  • 16,306
  • 5
  • 55
  • 74
  • +1 It felt like I was reinventing something that had been done before. Do you know if there's a way to do something similar with a different backend? According to http://packages.python.org/django-celery/reference/djcelery.models.html I can't do what you suggest with an `amqp` backend. – user Oct 18 '11 at 15:05
  • You have to use database backend, but it solves the problem. Thanks @Reto. – user Oct 18 '11 at 17:03