12

Can someone guide me on how gearman does retries when exceptions are thrown or when errors occur?

I use the python gearman client in a Django app and my workers are initiated as a Django command. I read from this blog post that retries from error conditions are not straight forward and that it requires sys.exit from the worker side.

Has this been fixed to retry perhaps with sendFail or sendException? Also does gearman support retries with exponentials algorithm – say if an SMTP failure happens its retries after 2,4,8,16 seconds etc?

Cherian
  • 19,107
  • 12
  • 55
  • 69
  • sys.exit() is a bad idea with Gearman - typically it will retry any such job forever (unless you have job-retries set on daemon startup). Just do a `return stringvar` with any status/results from the job (e.g. key into a DB row or cache with the real info.) – RichVel Jan 17 '13 at 13:17

1 Answers1

24

To my understanding, Gearman employs a very "it's not my business" approach - e.g., it does not intervene with jobs performed, unless workers crash. Any success / failure messages are supposed to be handled by the client, not Gearman server itself.

In foreground jobs, this implies that all sendFail() / sendException() and other send*() are directed to the client and it's up to the client to decide whether to retry the job or not. This makes sense as sometimes you might not need to retry.

In background jobs, all the send*() functions lose their meaning, as there is no client that would be listening to the callbacks. As a result, the messages sent will be just ignored by Gearman. The only condition on which the job will be retried is when the worker crashes (which can by emulated with a exit(XX) command, where XX is a non-zero value). This, of course, is not something you want to do, because workers are usually supposed to be long-running processes, not the ones that have to be restarted after each unsuccessful job.

Personally, I have solved this problem by extending the default GearmanJob class, where I intercept the calls to send*() functions and then implementing the retry mechanism myself. Essentially, I pass all the retry-related data (max number of retries, times already retried) together with a workload and then handle everything myself. It is a bit cumbersome, but I understand why Gearman works this way - it just allows you to handle all the application logic.

Finally, regarding the ability to retry jobs with exponential timeout (or any timeout for that matter). Gearman has a feature to add delayed jobs (look for SUBMIT_JOB_EPOCH in the protocol documentation), yet I am not sure about its status - the PHP extension and, I think, the Python module do not support it and the docs say it can be removed in the future. But I understand it works at the moment - you just need to submit raw socket requests to Gearman to make it happen (and the exponential part should be implemented on your side, too).

However, this blog post argues that SUBMIT_JOB_EPOCH implementation does not scale well. He uses node.js and setTimeout() to make it work, I've seen others use the unix utility at to do the same. In any way - Gearman will not do it for you. It will focus on reliability, but will let you focus on all the logic.

Aurimas
  • 2,518
  • 18
  • 23
  • 5
    I know it's an answer to an old question, but I've seen many people struggling with the same issue and I believe it's worth providing the full picture once and for all. – Aurimas Apr 27 '12 at 09:50