I've got a system that is basically a serial number generator.
There are two main design requirements of the serial number generator (in terms of fault tolerance):
- All serial numbers issued must be unique (no duplicates)
- If the serial number generator crashes, it needs a way to restart approximately from where it left off (skipping a few serial numbers when recovering from a crash is fine, if necessary, to ensure requirement #1 is upheld)
It can be assumed that serial numbers are issued sequentially (1, 2, 3, 4, etc.)
My first attempt to solve this is for the serial number generator to log every serial number that is issues by appending to a single log file. This way, if it crashes, it simply picks up from the last serial number issued, and goes along it's merry way.
Heres the challenge:
So, what are the pros/cons of the following logging approaches:
- Have a single log file which is appended, and capped at a certain size
- Have two log files, neither of which are appended. The first log file always contains the most recently issued serial number, and the second file contains the 2nd most recently issued serial number. The log files are written to in a see-saw fashion (i.e. all even serial numbers wind up getting put in log file 'A', and odd serial numbers get put in log file 'B'). The end result is that you've always got the last two serial numbers issued (and no more), the use of disk space is tiny, and if the serial number generator happened to crash while logging the most recent serial number, then the 'most recent' file may be corrupted. Since we also have the 2nd most recent serial number in the other log file, we can detect the crash, and restart the serial number generator from this '2nd most recent' serial number, +2
In a crash scenario where the serial number generator cannot tell which of the 'most recent' or '2nd most recent' log file is the one corrupted, it should be safe to always restart a crashed generator from the the non-corrupted serial number +2.
Option 1 is a little simpler to implement, but option 2 uses less disk space, and seems more intelligent.
Am I missing anything in terms of designing a system that can reliably recover from a crash via sufficient log files?