Read path in level compaction strategy in Cassandra

Question

I want to know whether a read request goes from higher levels(L3, L4, having more sstables) to lower levels(L0, L1, having lesser sstables) or the other way round.

The problem with read requests going from higher levels to lower levels is that a row in higher level sstable may contain obsolete data if the same row in a lower level sstable have been updated and not yet compacted to the higher level. Is that understanding correct?

On the other hand, going from lower level to higher levels won't ensure 90% read guarantee from a single sstable. In most cases, it will read all the levels.

Chris Lohfink · Answer 1 · 2017-03-29T15:41:02.163

It does go from lower levels to higher levels... more or less. A mutation from HH, read repair, or sstables streamed over from an anti-entropy repair can put old rows in the lower levels which will mess that up a bit. TWCS handles that a bit better (but still really not great).

It will read at most one per level (exception L0 which is STCS), and walk the sstables in order of the sstables age (which tends to correspond with level). Once it has all the columns it wont have to read any older sstables so it can stop because it knows that even if theres any data in older sstables it is obsolete and will lose in the LWW conflict.

There are some situations around counters, unfrozen UDTS, and tombstones where it will have to read all the sstables though.

The 90% comes from the cases where there are no updates to partitions. Because theres also a bloom filter with 10% false positive rate (default for lcs) it will 90% (ish) hit only the one sstable.

With update heavy or wide rows like time series, one sstable in each level will likely have the requested partition. In which case it will have to walk all the levels. For those it will use the sstables min/max timestamps and min/max clustering index to only read whats necessary. In terms of filtering the min/max sstable partition, and clustering is actually the first thing done.

The metric from nodetool tablehistograms "sstables per read" is actually the number of sstables up for reading between the partition/clustering filtering and before the bloom filter check (since that may have to read from disk). So you can use that metric to see how many sstables are actually being considered and has disk seeks.

That makes a lot of sense if the read requests go from lower levels to higher levels. But as I asked in the question, how does it guarantee 90% reads from a single sstable since lower level has fewer sstables and "sparse" data and probability of a read hit in lower level is very low. In most cases, a read request will require to read N sstables, where N is the number of levels. — user3545797, Mar 29 '17 at 05:43

score 0 · Answer 2 · edited May 23 '17 at 12:32

Please review this:

How does the Leveled Compaction Strategy ensure 90% of reads are from one sstable

In most cases, it will read all the levels

Only if you get to state when you have same key saved in all levels. And this is worst case scenario, when you have to read 1 sstable for each level.

Leveled compaction guarantees that 90% of all reads will be satisfied from a single sstable (assuming nearly-uniform row size). Worst case is bounded at the total number of levels — e.g., 7 for 10TB of data.

http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra

Read path in level compaction strategy in Cassandra

2 Answers2