Quartz retry when failure

Question

Let's say I have a trigger configured this way:

<bean id="updateInsBBTrigger"         
    class="org.springframework.scheduling.quartz.CronTriggerBean">
    <property name="jobDetail" ref="updateInsBBJobDetail"/>
    <!--  run every morning at 5 AM  -->
    <property name="cronExpression" value="0 0 5 * * ?"/>
</bean>

The trigger have to connect with another application and if there is any problem (like a connection failure) it should to retry the task up to five times every 10 minutes or until success. There is any way to configure the trigger to work like this?

score 15 · Answer 1 · answered Dec 06 '16 at 15:33

I would recommend an implementation like this one to recover the job after a fail:

final JobDataMap jobDataMap = jobCtx.getJobDetail().getJobDataMap();
// the keys doesn't exist on first retry
final int retries = jobDataMap.containsKey(COUNT_MAP_KEY) ? jobDataMap.getIntValue(COUNT_MAP_KEY) : 0;

// to stop after awhile
if (retries < MAX_RETRIES) {
  log.warn("Retry job " + jobCtx.getJobDetail());

  // increment the number of retries
  jobDataMap.put(COUNT_MAP_KEY, retries + 1);

  final JobDetail job = jobCtx
      .getJobDetail()
      .getJobBuilder()
       // to track the number of retries
      .withIdentity(jobCtx.getJobDetail().getKey().getName() + " - " + retries, "FailingJobsGroup")
      .usingJobData(jobDataMap)
      .build();

  final OperableTrigger trigger = (OperableTrigger) TriggerBuilder
      .newTrigger()
      .forJob(job)
       // trying to reduce back pressure, you can use another algorithm
      .startAt(new Date(jobCtx.getFireTime().getTime() + (retries*100))) 
      .build();

  try {
    // schedule another job to avoid blocking threads
    jobCtx.getScheduler().scheduleJob(job, trigger);
  } catch (SchedulerException e) {
    log.error("Error creating job");
    throw new JobExecutionException(e);
  }
}

Why?

It will not block Quartz Workers
It will avoid back pressure. With setRefireImmediately the job will be fired immediately and it could lead to back pressure issues

So just to make sure, you are adding a new Job (Job2) within another Job (Job1) if some failure appears and starting the Job2? Is this really the responsibility of the Job to handle such things or do you know a better (conceptionally) example to solve the exception handling issue? — schoener, Jan 16 '20 at 17:49
Conceptionally speaking I would say that the when we setup the job we should be able to setup this kind of configuration (maybe). I agree that the job is not the right place but last time I used this - 3 years ago - it was the best way I found to solve the problem. — Flávio Ferreira, Jan 23 '20 at 17:23

dogbane · Accepted Answer · 2011-03-15T09:17:50.100

Source: Automatically Retry Failed Jobs in Quartz

If you want to have a job which keeps trying over and over again until it succeeds, all you have to do is throw a JobExecutionException with a flag to tell the scheduler to fire it again when it fails. The following code shows how:

class MyJob implements Job {

    public MyJob() {
    }

    public void execute(JobExecutionContext context) throws JobExecutionException {

        try{
            //connect to other application etc
        }
        catch(Exception e){

            Thread.sleep(600000); //sleep for 10 mins

            JobExecutionException e2 = new JobExecutionException(e);
            //fire it again
            e2.setRefireImmediately(true);
            throw e2;
        }
    }
}

It gets a bit more complicated if you want to retry a certain number of times. You have to use a StatefulJob and hold a retryCounter in its JobDataMap, which you increment if the job fails. If the counter exceeds the maximum number of retries, then you can disable the job if you wish.

class MyJob implements StatefulJob {

    public MyJob() {
    }

    public void execute(JobExecutionContext context) throws JobExecutionException {
        JobDataMap dataMap = context.getJobDetail().getJobDataMap();
        int count = dataMap.getIntValue("count");

        // allow 5 retries
        if(count >= 5){
            JobExecutionException e = new JobExecutionException("Retries exceeded");
            //make sure it doesn't run again
            e.setUnscheduleAllTriggers(true);
            throw e;
        }


        try{
            //connect to other application etc

            //reset counter back to 0
            dataMap.putAsString("count", 0);
        }
        catch(Exception e){
            count++;
            dataMap.putAsString("count", count);
            JobExecutionException e2 = new JobExecutionException(e);

            Thread.sleep(600000); //sleep for 10 mins

            //fire it again
            e2.setRefireImmediately(true);
            throw e2;
        }
    }
}

-1, I don't recommend this approach - it will block one of Quartz worker threads for 10 minutes. The right way to go would be to facilitate existing Quartz functionality - tell it somehow to rerun the same job after 10 minutes - after all, this is what it is made for. If we are to run some code and sleep, there is no point in using Quartz on the first place. — Tomasz Nurkiewicz, Mar 05 '12 at 15:14
supplement that in Quartz 2.0 (for .net at least). The StatefulJob is replaced by `PersistJobDataAfterExecutionAttribute` http://quartznet.sourceforge.net/apidoc/2.0/html/html/babe3560-218c-38de-031a-7fe1fdd569d2.htm — ossek, Mar 24 '14 at 19:08
The JobExecutionContext (in Quartz 2.2.1 at least, not sure about the other versions) has a getRefireCount() method that could be used instead of the count variable — jrochette, Mar 25 '14 at 15:49

score 7 · Answer 3 · answered Dec 10 '10 at 14:14

I would suggest for more flexibility and configurability to better store in your DB two offsets: the repeatOffset which will tell you after how long the job should be retried and the trialPeriodOffset which will keep the information of the time window that the job is allowed to be rescheduled. Then you can retrieve these two parameters like (I assume you are using Spring):

String repeatOffset = yourDBUtilsDao.getConfigParameter(..);
String trialPeriodOffset = yourDBUtilsDao.getConfigParameter(..);

Then instead of the job to remember the counter it will need to remember the initalAttempt:

Long initialAttempt = null;
initialAttempt = (Long) existingJobDetail.getJobDataMap().get("firstAttempt");

and perform the something like the following check:

long allowedThreshold = initialAttempt + Long.parseLong(trialPeriodOffset);
        if (System.currentTimeMillis() > allowedThreshold) {
            //We've tried enough, time to give up
            log.warn("The job is not going to be rescheduled since it has reached its trial period threshold");
            sched.deleteJob(jobName, jobGroup);
            return YourResultEnumHere.HAS_REACHED_THE_RESCHEDULING_LIMIT;
        }

It would be a good idea to create an enum for the result of the attempt that is being returned back to the core workflow of your application like above.

Then construct the rescheduling time:

Date startTime = null;
startTime = new Date(System.currentTimeMillis() + Long.parseLong(repeatOffset));

String triggerName = "Trigger_" + jobName;
String triggerGroup = "Trigger_" + jobGroup;

Trigger retrievedTrigger = sched.getTrigger(triggerName, triggerGroup);
if (!(retrievedTrigger instanceof SimpleTrigger)) {
            log.error("While rescheduling the Quartz Job retrieved was not of SimpleTrigger type as expected");
            return YourResultEnumHere.ERROR;
}

        ((SimpleTrigger) retrievedTrigger).setStartTime(startTime);
        sched.rescheduleJob(triggerName, triggerGroup, retrievedTrigger);
        return YourResultEnumHere.RESCHEDULED;

retrievedTrigger must be cast to SimpleTriggerImpl in my case. — Ali Tofigh, Jan 06 '21 at 14:08

score 0 · Answer 4 · answered Apr 06 '23 at 15:08

I hope this information will be useful for you (this is a copy of my answer in this thread)

Below is an example of a multi-instance Spring Boot application that launches a cron job.
The Job must be running on only one of the instances.
The configuration of each instance must be the same.
If a job crashes, it should try to restart 3 times with a delay of 5 minutes * number of restart attempts.
If the job still crashes after 3 restarts, the default cron for our job trigger should be set.

We will use Quartz in cluster mode:

Deps:

implementation("org.springframework.boot:spring-boot-starter-quartz")

At first, it is a bad idea to use Thread.sleep(600000) as said in this answer
Out job:

@Component
@Profile("quartz")
class SomeJob(
    private val someService: SomeService
) : QuartzJobBean() {
    private val log: Logger = LoggerFactory.getLogger(SomeJob::class.java)
    
    override fun executeInternal(jobExecutionContext: JobExecutionContext) {
        try {
            log.info("Doing awesome work...")
            someService.work()
            if ((1..10).random() >= 5) throw RuntimeException("Something went wrong...")
        } catch (e: Exception) {
            throw JobExecutionException(e)
        }
    }
}

Here is the Quartz configuration (more information here):

@Configuration
@Profile("quartz")
class JobConfig {
    //JobDetail for our job
    @Bean
    fun someJobDetail(): JobDetail {
        return JobBuilder
            .newJob(SomeJob::class.java).withIdentity("SomeJob")
            .withDescription("Some job")
            //If we want the job to be launched after the application instance crashes at the 
            //next launch
            .requestRecovery(true)
            .storeDurably().build()
    }

    //Trigger
    @Bean
    fun someJobTrigger(someJobDetail: JobDetail): Trigger {
        return TriggerBuilder.newTrigger().forJob(someJobDetail)
            .withIdentity("SomeJobTrigger")
            .withSchedule(CronScheduleBuilder.cronSchedule("0 0 4 L-1 * ? *"))
            .build()

    }

    //Otherwise, changing cron for an existing trigger will not work. (the old cron value will be stored in the database)
    @Bean
    fun scheduler(triggers: List<Trigger>, jobDetails: List<JobDetail>, factory: SchedulerFactoryBean): Scheduler {
        factory.setWaitForJobsToCompleteOnShutdown(true)
        val scheduler = factory.scheduler
        factory.setOverwriteExistingJobs(true)
        //https://stackoverflow.com/questions/39673572/spring-quartz-scheduler-race-condition
        factory.setTransactionManager(JdbcTransactionManager())
        rescheduleTriggers(triggers, scheduler)
        scheduler.start()
        return scheduler
    }

    private fun rescheduleTriggers(triggers: List<Trigger>, scheduler: Scheduler) {
        triggers.forEach {
            if (!scheduler.checkExists(it.key)) {
                scheduler.scheduleJob(it)
            } else {
                scheduler.rescheduleJob(it.key, it)
            }
        }
    }
}

Add a listener to the scheduler:

@Component
@Profile("quartz")
class JobListenerConfig(
    private val schedulerFactory: SchedulerFactoryBean,
    private val jobListener: JobListener
) {
    @PostConstruct
    fun addListener() {
        schedulerFactory.scheduler.listenerManager.addJobListener(jobListener, KeyMatcher.keyEquals(jobKey("SomeJob")))
    }
}

And now the most important - the logic of processing the execution of our job with listener:

@Profile("quartz")
class JobListener(
    //can be obtained from the execution context, but it can also be injected
    private val scheduler: Scheduler,
    private val triggers: List<Trigger>
): JobListenerSupport() {

    private lateinit var triggerCronMap: Map<String, String>

    @PostConstruct
    fun post(){
        //there will be no recovery triggers , only our self-written ones
        triggerCronMap = triggers.associate {
            it.key.name to (it as CronTrigger).cronExpression
        }
    }

    override fun getName(): String {
        return "myJobListener"
    }


    override fun jobToBeExecuted(context: JobExecutionContext) {
        log.info("Job: ${context.jobDetail.key.name} ready to start by trigger: ${context.trigger.key.name}")
    }


    override fun jobWasExecuted(context: JobExecutionContext, jobException: JobExecutionException?) {
        //you can use context.mergedJobDataMap
        val dataMap = context.trigger.jobDataMap
        val count = if (dataMap["count"] != null) dataMap.getIntValue("count") else {
            dataMap.putAsString("count", 1)
            1
        }
        //in the if block, you can add the condition && !context.trigger.key.name.startsWith("recover_") - in this case, the scheduler will not restart recover triggers if they fall during execution
        if (jobException != null ){
            if (count < 3) {
                log.warn("Job: ${context.jobDetail.key.name} filed while execution. Restart attempts count: $count ")
                val oldTrigger = context.trigger
                var newTriggerName = context.trigger.key.name + "_retry"
                //in case such a trigger already exists
                context.scheduler.getTriggersOfJob(context.jobDetail.key)
                    .map { it.key.name }
                    .takeIf { it.contains(newTriggerName) }
                    ?.apply { newTriggerName += "_retry" }
                val newTrigger = TriggerBuilder.newTrigger()
                    .forJob(context.jobDetail)
                    .withIdentity(newTriggerName, context.trigger.key.group)
                    //create a simple trigger that should be fired in 5 minutes * restart attempts
                    .startAt(Date.from(Instant.now().plus((5 * count).toLong(), ChronoUnit.MINUTES)))
                    .usingJobData("count", count + 1 )
                    .build()
                val date = scheduler.rescheduleJob(oldTrigger.key, newTrigger)
                log.warn("Rescheduling trigger: ${oldTrigger.key} to trigger: ${newTrigger.key}")
            } else {
                log.warn("The maximum number of restarts has been reached. Restart attempts: $count")
                recheduleWithDefaultTrigger(context)
            }
        } else if (count > 1) {
            recheduleWithDefaultTrigger(context)
        }
        else {
            log.info("Job: ${context.jobDetail.key.name} completed successfully")
        }
        context.scheduler.getTriggersOfJob(context.trigger.jobKey).forEach {
            log.info("Trigger with key: ${it.key} for job: ${context.trigger.jobKey.name} will start at ${it.nextFireTime ?: it.startTime}")
        }
    }

    private fun recheduleWithDefaultTrigger(context: JobExecutionContext) {
        val clone = context.jobDetail.clone() as JobDetail
        val defaultTriggerName = context.trigger.key.name.split("_")[0]
        //Recovery triggers should not be rescheduled
        if (!triggerCronMap.contains(defaultTriggerName)) {
            log.warn("This trigger: ${context.trigger.key.name} for job: ${context.trigger.jobKey.name} is not self-written trigger. It can be recovery trigger or whatever. This trigger must not be recheduled.")
            return
        }
        log.warn("Remove all triggers for job: ${context.trigger.jobKey.name} and schedule default trigger for it: $defaultTriggerName")
        scheduler.deleteJob(clone.key)
        scheduler.addJob(clone, true)
        scheduler.scheduleJob(
            TriggerBuilder.newTrigger()
                .forJob(clone)
                .withIdentity(defaultTriggerName)
                .withSchedule(CronScheduleBuilder.cronSchedule(triggerCronMap[defaultTriggerName]))
                .usingJobData("count", 1)
                .startAt(Date.from(Instant.now().plusSeconds(5)))
                .build()
        )
    }
}

Last but not least: application.yaml

spring:
  quartz:
    job-store-type: jdbc #Database Mode
    jdbc:
      initialize-schema: never #Do not initialize table structure
    properties:
      org:
        quartz:
          scheduler:
            instanceId: AUTO #Default hostname and timestamp generate instance ID, which can be any string, but must be the only corresponding qrtz_scheduler_state INSTANCE_NAME field for all dispatchers
            #instanceName: clusteredScheduler #quartzScheduler
          jobStore:
#            a few problems with the two properties below: https://github.com/spring-projects/spring-boot/issues/28758#issuecomment-974628989 & https://github.com/quartz-scheduler/quartz/issues/284
#            class: org.springframework.scheduling.quartz.LocalDataSourceJobStore #Persistence Configuration
            driverDelegateClass: org.quartz.impl.jdbcjobstore.PostgreSQLDelegate #We only make database-specific proxies for databases
#            useProperties: true #Indicates that JDBC JobStore stores all values in JobDataMaps as strings, so more complex objects can be stored as name-value pairs rather than serialized in BLOB columns.In the long run, this is safer because you avoid serializing non-String classes to BLOB class versions.
            tablePrefix: scam_quartz.QRTZ_  #Database Table Prefix
            misfireThreshold: 60000 #The number of milliseconds the dispatcher will "tolerate" a Trigger to pass its next startup time before being considered a "fire".The default value (if you do not enter this property in the configuration) is 60000 (60 seconds).
            clusterCheckinInterval: 5000 #Set the frequency (in milliseconds) of this instance'checkin'* with other instances of the cluster.Affects the speed of detecting failed instances.
            isClustered: true #Turn on Clustering
          threadPool: #Connection Pool
            class: org.quartz.simpl.SimpleThreadPool
            threadCount: 3
            threadPriority: 1
            threadsInheritContextClassLoaderOfInitializingThread: true

Here official scripts for database (use liquibase or flyway)
More information:
About quartz
spring boot using quartz in cluster mode
One more article
Cluster effectively quartz

Quartz retry when failure

4 Answers4

Linked