I have a server application which, somewhat simplified, periodically takes measurements via a rest-api from a not beefy-enough-server. The values should be cached locally (and are timestamped/immutable), maybe stored as a FloatBuffer where every position corresponds to a measurement sample. There's a webbrowser application which periodically makes ajax requests to update some neat statistics on the webpage, like this picture:
Assuming that the server is up and running, there are still many places where errors could occur
- The REST measurement server could be unreachable (where the server just keeps storing measurements locally)
- The network connection to the measurement server could be down
- The storage could be full or somehow corrupt
- The Browser could lose contact with the server and try to take it up again
My strategy for coping with errors in general should be the following:
If there are problems getting values from the measurement service via REST, there should be retries every minute. If the error persists for more than 30 minutes consequtively the administrator should be notified. In case of disk problems the administratior should be notified at once, or preferably even before the disk goes full.
The end user experience should be as transparent to the errors as possible, but the application should still function as sanely as possible, by notifying the user an error have occured but also show the latest data availiable.
How do I find which errors to cope with regarding network problems (using clj-http via an agent triggered by a ScheduledThreadPoolExecutor
job to make REST request) and regarding problems with disk when trying to flush the FloatBuffer
?
What is a sane way to implement the quite stateful yet algorithmic strategy mentioned above? Should I simply handle the error when the agent
reports it and switch to some kind of a recovery-mode job?