1

Implementation wise, how exactly does the memtable (in Cassandra, RocksDB, LevelDB, or any LSM-tree) flush to an SSTable?

I get that a memtable is some sorted data structured, like a red-black tree, but how do we turn that into a file of sorted key/value pairs? Do we iterate through the tree from the smallest key to the largest tree in a for-loop and insert the data one by one into an memory buffer (in SSTable format), and then write that to disk? Do we use some sort of tree-serialize method (if so, how is that still in SSTable format)? Can we just use a min-heap for the memtable and when flushing, keep getting the min-element and adding it to our array to flush?

I'm trying to understand the super specific details. I was looking at this file but was having a hard time understanding it: https://github.com/facebook/rocksdb/blob/fbfcf5cbcd3b09b6de0924d3c52a744a626135c0/db/flush_job.cc

squidword
  • 437
  • 3
  • 11

2 Answers2

2

You are correct.
The memtable is looped over from smallest to largest and written out to file.

In practicality there are other things as well written to the file but the foundation of the file is the section that has all the keys that were previously in the memtable. Such as bloom filters, seek sparse indices, and other metadata such as count, max key, min key

You don't need a minheap. As the data is already sorted in the skiplist

Asad Awadia
  • 1,417
  • 2
  • 9
  • 15
  • Do we necessarily need to write the bloom filter/sparse index to the same file? Let's say I wanted to store the segment data itself to a different storage drive (HDD) and the metadata in SSD (in theory). – squidword Aug 08 '22 at 21:04
  • No it is not necessary to be in the same file - you can structure things however you want. If they are together you don't have to keep the mapping of where they are since they are right there in the file – Asad Awadia Aug 08 '22 at 22:53
  • I had similar issue - What I am doing - I have two threads - One is accepting writes and inserting into memtable ( skiplist ) .. Once skiplist reaches a set number of keys say 100K then I push this memtable into `concurrentLinkedList` from where another thread is polling and write it to the file. Now problem is - Data writing by another thread is so slow (bec of index serialisation , data etc ) that my queue got 100s of memtable pending in memory - due to which I start getting OOM error . Any solution of these kind of problem ? – voila Aug 16 '22 at 11:47
  • Cap the number of active memtables there can be at any time - simply stated if the consumer is slow the producer needs to slow down putting memtables into the list. Rocksdb caps memtables to 2. One active and one that is being flushed – Asad Awadia Aug 16 '22 at 13:43
1

RocksDB's default memtable is implemented using skiplist, which is a linked list with binary search capability, similar to a B+ tree. When writing out to an SST table, it iterates all the keys in the sorted order.

Siying Dong
  • 131
  • 1