8

I have pretty standard Django+Rabbitmq+Celery setup with 1 Celery task and 5 workers.

Task uploads the same (I simplify a bit) big file (~100MB) asynchronously to a number of remote PCs.

All is working fine at the expense of using lots of memory, since every task/worker load that big file into memory separatelly.

What I would like to do is to have some kind of cache, accessible to all tasks, i.e. load the file only once. Django caching based on locmem would be perfect, but like documentation says: "each process will have its own private cache instance" and I need this cache accessible to all workers.

Tried to play with Celery signals like described in #2129820, but that's not what I need.

So the question is: is there a way I can define something global in Celery (like a class based on dict, where I could load the file or smth). Or is there a Django trick I could use in this situation ?

Thanks.

Community
  • 1
  • 1

3 Answers3

2

Why not simply stream the upload(s) from disk instead of loading the whole file in memory ?

Luper Rouch
  • 9,304
  • 7
  • 42
  • 56
1

It seems to me that what you need is memcached backed for django. That way each task in Celery will have access to it.

Łukasz
  • 35,061
  • 4
  • 33
  • 33
  • I thought about it, however the biggest value one can store in memcached is 1 MB. –  Mar 23 '10 at 14:58
  • Why not partition the file? And if every task requires access to every bit of this file then there's no way of avoiding loading it every time. – Łukasz Mar 23 '10 at 15:00
  • Well, I'm hoping it is possible :). Partitioning would increase the complexity and I think there should be simper way to tackle this. –  Mar 23 '10 at 15:07
  • Shared memory across different processes? If all tasks are running on the same machine (if you're using single Celery server) you can try using http://pypi.python.org/pypi/posix_ipc – Łukasz Mar 23 '10 at 15:24
  • I'm using single Celery server, yes. posix_ipc is certainly interesting but I feel is too low level so solve this problem. I believe that solution lies somewhere in Django caching or custom Celery loader or smth alike. –  Mar 23 '10 at 15:51
  • Perhaps use a combination of Amazon S3 (or *some* file store) + Memcached - memcached can simply store a location in S3 for all other tasks to download and work on. – rlotun Mar 23 '10 at 18:10
  • Thanks everyone for your ideas. To keep it simple I'll probably end up uploading files in chunks of few MB. –  Mar 25 '10 at 08:17
  • Hi, I'm struggling with the same issue, I'd be happy to hear which solution you chose at the end... Best – jhagege Sep 29 '14 at 03:36
0

Maybe you can use threads instead of processes for this particular task. Since threads all share the same memory, you only need one copy of the data in memory, but you still get parallel execution. ( this means not using Celery for this task )

Nick Perkins
  • 8,034
  • 7
  • 40
  • 40