1

I want to know whether a read request goes from higher levels(L3, L4, having more sstables) to lower levels(L0, L1, having lesser sstables) or the other way round.

The problem with read requests going from higher levels to lower levels is that a row in higher level sstable may contain obsolete data if the same row in a lower level sstable have been updated and not yet compacted to the higher level. Is that understanding correct?

On the other hand, going from lower level to higher levels won't ensure 90% read guarantee from a single sstable. In most cases, it will read all the levels.

user3545797
  • 39
  • 1
  • 7

2 Answers2

0

It does go from lower levels to higher levels... more or less. A mutation from HH, read repair, or sstables streamed over from an anti-entropy repair can put old rows in the lower levels which will mess that up a bit. TWCS handles that a bit better (but still really not great).

It will read at most one per level (exception L0 which is STCS), and walk the sstables in order of the sstables age (which tends to correspond with level). Once it has all the columns it wont have to read any older sstables so it can stop because it knows that even if theres any data in older sstables it is obsolete and will lose in the LWW conflict.

There are some situations around counters, unfrozen UDTS, and tombstones where it will have to read all the sstables though.

The 90% comes from the cases where there are no updates to partitions. Because theres also a bloom filter with 10% false positive rate (default for lcs) it will 90% (ish) hit only the one sstable.

With update heavy or wide rows like time series, one sstable in each level will likely have the requested partition. In which case it will have to walk all the levels. For those it will use the sstables min/max timestamps and min/max clustering index to only read whats necessary. In terms of filtering the min/max sstable partition, and clustering is actually the first thing done.

The metric from nodetool tablehistograms "sstables per read" is actually the number of sstables up for reading between the partition/clustering filtering and before the bloom filter check (since that may have to read from disk). So you can use that metric to see how many sstables are actually being considered and has disk seeks.

Chris Lohfink
  • 16,150
  • 1
  • 29
  • 38
  • That makes a lot of sense if the read requests go from lower levels to higher levels. But as I asked in the question, how does it guarantee 90% reads from a single sstable since lower level has fewer sstables and "sparse" data and probability of a read hit in lower level is very low. In most cases, a read request will require to read N sstables, where N is the number of levels. – user3545797 Mar 29 '17 at 05:43
  • I explained a bit more above since too much for comment – Chris Lohfink Mar 29 '17 at 15:41
0

Please review this:

How does the Leveled Compaction Strategy ensure 90% of reads are from one sstable

In most cases, it will read all the levels

Only if you get to state when you have same key saved in all levels. And this is worst case scenario, when you have to read 1 sstable for each level.

Leveled compaction guarantees that 90% of all reads will be satisfied from a single sstable (assuming nearly-uniform row size). Worst case is bounded at the total number of levels — e.g., 7 for 10TB of data.

http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra

Community
  • 1
  • 1
nevsv
  • 2,448
  • 1
  • 14
  • 21