6

I have a Kotlin scheduling config file below. It has an important task scheduled to run at 11am each Monday.

What do I need to do for building resiliency or retry attempts in case the service is down at 11am?

Can these Spring Boot and Kotlin @Scheduled jobs be configured for enterprise level resiliency or do I need to look to use something like Kubernetes CronJobs to achieve this?

I am also looking into Spring Boot Quartz scheduler with a JobStore as an option. Any alternative setup suggestions are welcome.

@Component
class CronConfig {

    private val logger = LoggerFactory.getLogger(CronConfig::class.java)

    // Run Monday morning @ 11am
    @Scheduled(cron = "0 0 11 * * MON")
    fun doSomething(){
        logger.info("Doing something")
    }
}
William Ross
  • 3,568
  • 7
  • 42
  • 73

2 Answers2

3

It's good that you're thinking about what might go wrong. (Too often we developers assume everything will go right, and don't consider and handle all the ways our code could fail!)

Unfortunately, I don't think there's a standard practice for this; the right approach probably depends on your exact situation.

Perhaps the simplest approach is just to ensure that your function cannot fail, by doing error-handling, and if needed waiting and retrying, within it. You could split the actual processing out to a separate method if that makes it more readable, e.g.:

@Scheduled(cron = "0 0 11 * * MON")
fun doSomethingDriver() {
    while (true) { // Keep trying until successful…
        try {
            doSomething()
            return // It worked!
        } catch (x: Exception) {
            logger.error("Can't doSomething: {}.  Will retry…", x.message)
            TimeUnit.SECONDS.sleep(10L)
        }
    }
}

fun doSomething() {
    logger.info("Doing something")
    // …
}

That's pretty straightforward.  One disadvantage is that it keeps the thread waiting between retries; since Spring uses a single-threaded scheduler by default (see these questions), that means it could delay any other scheduled jobs.

Alternatively, if your scheduled function doesn't keep retrying, then you'll need some other way to trigger a retry.

You could ‘poll’: store the time of the last successful run, change the scheduled function to run much more frequently, and have it check whether another run is needed (i.e. whether there's been no successful run since the last 11am Monday). This will be more complex — especially as it needs to maintain state and do date/time processing. (You shouldn't need to worry about concurrency, though, unless you've made the function @Async or set up your own scheduling config.) It's also a little less efficient, due to all the extra scheduled wake-ups.

Or you could trap errors (like the code above) but instead of waiting and retrying, manually schedule a retry for a future time, e.g. using your own TaskExecutor. This would also be more complex.

gidds
  • 16,558
  • 2
  • 19
  • 26
  • Thank you for the very informative and helpful reply! I'm now debating between putting this type of failure checking logic into the Kotlin code, or setting up a Kubernetes CronJob to call a script which calls `doSomething`. Do you know if it's a more standard approach with Kotlin to have the Exception catching/TaskExecutor handle the retry or do that on the infra side with Kubernetes? – William Ross Jun 01 '23 at 19:45
  • Sorry, never worked with anything like Kubernetes… – gidds Jun 01 '23 at 21:02
1

If you want to ensure you never miss that Monday task, my experience is that only some of the reasons your solution may fail will be caused by the code itself and confining solutions to a try/catch/retry will miss wider causes, e.g.:

  • the running environment runs out of some resource (disc, memory) and the service is not alive at the time the Cron schedule tries to run. (Kubernetes will generally help minimise these cases, I grant you).
  • someone choses just that special time to deploy a new container so the Cron schedule is missed
  • you later evolve the service so you have more than one instance of the process so you get multiple executions

As great as Kubernetes is, once the Pod restarts, you typically lose the log files so you cannot easily know what happened and whether your important process ran.

For these cases I suggest two approaches. (One of these matches @gidds suggestions)

1. Maintain state outside the application in a trusted backing store.

The application has the @Scheduled to run the nominated time, but also on startup to look for a nextRunAt datetime it in the external store. If a run has been missed, then it is easy for that process and humans to know and take action. You can have Spring call a method on startup in this way:

@Bean
fun startUp() = CommandLineRunner {
    ...
}

Of course, the process needs to update the nextRunAt.

2. Use a messaging system and simple scheduler

This more complex solution depends on what other infrastructure you also have in your mix. If you have a resilient Message Queuing system and with the correct use of transactional messaging, a "command" message is placed on a Queue at the run time§. One or more worker nodes subscribe to this Queue. The first to acquire the message will process it, and that worker needs to properly acknowledge the messages as being processed. If that worker does not, e.g. if the worker processing thread dies, or the whole JVM/etc dies then the Queue Manager will offer it to another subscriber after a suitable timeout (you need to manage that timeout carefully so you don't get a double-execution just because the process is still running). This approach works even if you only ever intend to have one worker... as soon as if comes back on line, the message it there for it the the process will run.

Most Queue Managers will have a management interface where you can see if there is a message waiting.

§ Of course, you still need a process to place the message on the queue at the right time. The Queue apporach gives you a very resilient process solution BUT there still a single-point of failure - the scheduler. So the design of this should the simplest technology you can get hold of which you can rationalise has very low chance of failure.

That "command" message can be just a blank message in a particular Queue; that's enough. Most Queue systems have an HTTP entry point to create a simple message, so you can imagine:

  1. a Kubernetes CronJob (the Kube people have made this reliable)
  2. that calls a shell script (easy to reason this won't fail)
  3. that uses curl to use HTTP to publish a message on a Queue (this too should be easy enough to be sure this won't fail)
  4. The Queue system won't lose your message - that's its job!
AndrewL
  • 2,034
  • 18
  • 18
  • Appreciate the detailed response! Was curious if you know anything about the Quartz Scheduler, https://www.quartz-scheduler.org/ Some of the official docs recommend this library to use it for storage. My current understanding is it uses the database as backup for when jobs fail, probably similar to what you mentioned above with `nextRunAt` – William Ross Jun 07 '23 at 13:23
  • No, I don't know. I've only used the embedded Scheduler in Spring. – AndrewL Jun 08 '23 at 00:10