I have a service that can start an "operation" when you call an API endpoint with some parameters. This starts a background coroutine using get_scheduler(request).spawn(coro_fn)
. This coroutine sometimes takes say an hour to finish. In the end it updates the operation status in the database to "finished"
.
The problem is that, if the service is restarted / killed during the execution of this coroutine, it will never finish and database won't be updated, leaving the operation hanging in the "in_progress"
state forever.
I see 2 ways of fixing this:
- Do a graceful shutdown, wherein when service crashes / is restarted, it first waits for all hanging operations to finish, and only then shuts down. This has an issue though: as I've said, operations sometimes take a long time to finish, which means that restart could, theoretically, take up to an hour, during which new operations will be created, so it could take even more time. I want restarts to be quick.
- Save operation state into the database and restart them when starting up the service. The problem here is that I don't want to do half the computation again. I would like to restore exact coroutine state that it had before shutting down. Is this possible? I suspect this might not work considering the bytecode for the coroutine code could change between restarts and so the saved state won't point to the same spot in code. If not, what other ways of solving this problem are there?