14

The WAL (Write-Ahead Log) technology has been used in many systems.

The mechanism of a WAL is that when a client writes data, the system does two things:

  1. Write a log to disk and return to the client
  2. Write the data to disk, cache or memory asynchronously

There are two benefits:

  1. If some exception occurs (i.e. power loss) we can recover the data from the log.
  2. The performance is good because we write data asynchronously and can batch operations

Why not just write the data into disk directly? You make every write directly to disk. On success, you tell client success, if the write failed you return a failed response or timeout.

In this way, you still have those two benefits.

  1. You do not need to recover anything in case of power off. Because every success response returned to client means data really on disk.
  2. Performance should be the same. Although we touch disk frequently, but WAL is the same too (Every success write for WAL means it is success on disk)

So what is the advantage of using a WAL?

chaosaffe
  • 848
  • 9
  • 22
Kramer Li
  • 2,284
  • 5
  • 27
  • 55
  • If you have a separate log, then you can restore on another machine much more easily, for the sake of redundancy and scalability – Stephen Collins Nov 04 '19 at 13:16

3 Answers3

9

Performance.

  • Step two in your list is optional. For busy records, the value might not make it out of the cache and onto the disk before it is updated again. These writes do not need to be performed, with only the log writes performed for possible recovery.

  • Log writes can be batched into larger, sequential writes. For busy workloads, delaying a log write and then performing a single write can significantly improve throughput.

This was much more important when spinning disks were the standard technology because seek times and rotational latency were a bit issue. This is the physical process of getting the right part of the disk under the read/write head. With SSDs those considerations are not so important, but avoiding some writes, and large sequential writes still help.

Update:

SSDs also have better performance with large sequential writes but for different reasons. It is not as simple as saying "no seek time or rotational latency therefore just randomly write". For example, writing large blocks into space the SSD knows is "free" (eg. via the TRIM command to the drive) is better than read-modify-write, where the drive also needs to manage wear levelling and potentially mapping updates into different internal block sizes.

janm
  • 17,976
  • 1
  • 43
  • 61
  • 1
    Hmm. If you delay writes to batch them, doesn’t the durability guarantee go out the window? And if so what is the whole point of a WAL? – Sush Mar 19 '22 at 17:01
  • 2
    @Sush No, the durability guarantee is still there because the commit still completes after the write is complete. For this to be helpful, the workload needs be high enough. The idea is to improve overall system throughput at the cost of some extra latency for some transactions. – janm Mar 21 '22 at 10:51
  • 1
    Ah I see. So just to concrete my understanding, say the sequential write to the HDD takes 1ms- each transaction then would take at least 1ms to complete (likely longer due to batching)- but the throughput of the system would increase- proportional to the size of each batch (limited by max disk IO write size)… also server “holds on” to the client’s transaction until it’s batch is flushed to disk? Fascinating. – Sush Mar 22 '22 at 16:41
  • Additionally, the WAL can be written to a fast device like a NVMe or SSD while the data will be stored permanently on slower devices. This makes sense in bigger storage setups while maintaining budget. – mgabriel Oct 05 '22 at 20:39
1

As you note a key contribution of a WAL is durability. After a mutation has been committed to the WAL you can return to the caller, because even if the system crashes the mutation is never lost.

If you write the update directly to disk, there are two options:

  1. write all records to the end of some file
  2. the files are somehow structured

If you go with 1) it is needless to say that the cost of read is O(mutations), hence pretty much every system uses 2). RocksDB uses an LSM, which uses files that are internally sorted by key. For that reason, "directly writing to disk" means that you possibly have to rewrite every record that comes after the current key. That's too expensive, so instead you

  1. write to the WAL for persistence
  2. update the memtables (in RAM)

Because the memtables and the files on disk are sorted, read accesses are still reasonably fast. Updating the sorted structure in memory is easy because that's just a balanced tree. When you flush the memtable to disk and/or run a compaction, you will rewrite your file-structures to the updated state as a result of many writes, which makes each write substantially cheaper.

midor
  • 5,487
  • 2
  • 23
  • 52
0

I have some guess.

Make every write to disk directly do not need recovery on power off. But the performance issue need to discuss in two way.

situation 1:

All your storage device is spinning disk. The WAL way will have better performance. Because when you write WAL it is sequential write. The write data to disk operation is random write. The performance for random write is very poor than sequential write for spinning disk.

situation 2: All your device is SSD. Then the performance may not be too much difference. Because sequential write and random write have almost the same performance for SSD.

Kramer Li
  • 2,284
  • 5
  • 27
  • 55