Does Celery support per-worker local storage?
the reason I ask is: I need to use GPU for lots of small tasks, and allocating-and-deallocating GPU thread is dominating the computation time.
I have tried threadLocal = threading.local()
as per Thread local storage in Python
but it appears that each new call gets a fresh thread . . . .