How to maintain the sparse index in a LSM-tree?

Question

In Designing Data Intensive Applications, Martin introduces a data structure called LSM-trees.

There are mainly 3 parts: an in-memory memtable (usually a red-black tree), an in-memory sparse index, and on-disk SSTables (aka segments). They work together like this:

When a write happens, it first goes to the memtable, and when it turns full, all the data are flushed into a new segment (with all the keys sorted).
When a read happens, it first looks up the memtable. If the key doesn't exist there, it looks up the sparse index, to learn which segment the key may reside. See figure 1.
Periodically, compaction happens that merges multiple segments into one. See figure 2.

As you can tell from figure 2, keys are sorted within a segment, however keys are NOT sorted between segments. This make me wonder: how do we maintain the sparse index s.t. keys in the index have increasing offset?

Martin Kleppmann · Accepted Answer · 2021-09-08T14:45:33.960

8

A typical approach is to have a separate index per segment file, and this index is re-generated during compaction/merging of segment files. When reading a key, we then have to check multiple current segment files that may contain the key, and return the value that appears in the most recent of those segments.

It's not possible to tell just from looking at the index whether a particular segment contains a particular key. To avoid having to do a disk read for every segment, a common optimisation is to have a Bloom filter (or similar data structure such as a Cuckoo filter) for each segment that summarises the keys contained within that segment. That allows the read operation to only make a disk read for those segments that actually contain the desired key (with a small probability of making unnecessary disk reads due to Bloom filter false positives).

edited Sep 08 '21 at 14:45

answered Sep 08 '21 at 13:22

Martin Kleppmann

1,315
12
9

If there is a separate index per segment, it does speed up the in-segment lookup (binary search vs scan) if we know the key is in that segment. However, how to use sparse indexes to locate which segment files may contain the key? – yiksanchan Sep 08 '21 at 13:38
Or does it require a scan of all per-segment sparse indexes? – yiksanchan Sep 08 '21 at 14:20
I edited the answer to respond to your question. – Martin Kleppmann Sep 08 '21 at 14:46
1

So each segment will have a BloomFilter sort of thing as well as a sparse index. The former one helps to locate "which segment file may contain the key" easily, and the latter one helps to locate the key inside the segment file if it may exist. Is that correct? – yiksanchan Sep 09 '21 at 01:33

How to maintain the sparse index in a LSM-tree?

1 Answers1