What exactly happens when tombstone limit is reached

Question

According to cassandra's log (see below) queries are getting aborted due to too many tombstones being present. This is happening because once a week I cleanup (delete) rows with a counter that is too low. This 'deletes' hundreds of thousands of rows (marks them as such with a tombstone.)

It is not at all a problem if, in this table, a deleted row re-appears because a node was down during the cleanup process, so I set the gc grace time for the single affected table to 10 hours (down from default 10 days) so the tombstoned rows can get permanently deleted relatively fast.

Regardless, I had to set the tombstone_failure_threshold extremely high to avoid the below exception. (one hundred million, up from one hundred thousand.) My question is, is this necessary? I have absolutely no idea what type of queries get aborted; inserts, selects, deletes?

If it's merely some selects being aborted, it's not that big a deal. But that's assuming abort means 'capped' in that the query stops prematurely and returns whatever live data it managed to gather before too many tombstones were found.

Well, to ask it simpler; what happens when the tombstone_failure_threshold is exceeded?

INFO [HintedHandoff:36] 2014-02-12 17:44:22,355 HintedHandOffManager.java (line 323) Started hinted handoff for host: fb04ad4c-xxxx-4516-8569-xxxxxxxxx with IP: /XX.XX.XXX.XX
ERROR [HintedHandoff:36] 2014-02-12 17:44:22,667 SliceQueryFilter.java (line 200) Scanned over 100000 tombstones; query aborted (see tombstone_fail_threshold)
ERROR [HintedHandoff:36] 2014-02-12 17:44:22,668 CassandraDaemon.java (line 187) Exception in thread Thread[HintedHandoff:36,1,main]
org.apache.cassandra.db.filter.TombstoneOverwhelmingException
    at org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:201)
    at org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:122)
    at org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:80)
    at org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:72)
    at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:297)
    at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:53)
    at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1516)
    at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1335)
    at org.apache.cassandra.db.HintedHandOffManager.doDeliverHintsToEndpoint(HintedHandOffManager.java:351)
    at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:309)
    at org.apache.cassandra.db.HintedHandOffManager.access$300(HintedHandOffManager.java:92)
    at org.apache.cassandra.db.HintedHandOffManager$4.run(HintedHandOffManager.java:530)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

Forgot to mention; running Cassandra version 2.0.4

score 30 · Accepted Answer · answered Feb 14 '14 at 06:46

When a query that returns a range of rows (or columns) is issued to Cassandra, it has to scan the table to collect the result set (this is called a slice). Now, deleted data is stored in the same manner as regular data, except that it's marked as tombstoned until compacted away. But the table reader has to scan through it nevertheless. So if you have tons of tombstones lying around, you will have an arbitrarily large amount of work to do to satisfy your ostensibly limited slice.

A concrete example: let's say you have two rows with clustering keys 1 and 3, and a hundred thousand dead rows with clustering key 2 that are located in between rows 1 and 3 in the table. Now when you issue a SELECT query where the key is to be >= 1 and < 3, you'll have to scan 100002 rows, instead of the expected two.

To make it worse, Cassandra doesn't just scan through these rows, but also has to accumulate them in memory while it prepares the response. This can cause an out-of-memory error on the node if things go too far out, and if multiple nodes are servicing the request, it may even cause a multiple failure bringing down the whole cluster. To prevent this from happening, the service aborts the query if it detects a dangerous number of tombstones. You're free to crank this up, but it's risky, if your Cassandra heap is close to running out during these spikes.

This exception was introduced in a recent fix, first available in 2.0.2. Here is the bug entry describing the problem the change was trying to address. Previously everything would have been just fine, until one of your nodes, or potentially several, suddenly crashed.

If it's merely some selects being aborted, it's not that big a deal. But that's assuming abort means 'capped' in that the query stops prematurely and returns whatever live data it managed to gather before too many tombstones were found.

The query doesn't return a limited set, it actually drops the request completely. If you'd like to mitigate, maybe it's worth doing your bulk row deletion at the same cadence as the grace period, so you don't have this huge influx of tombstones every week.

According to the error log in my question, the exception occurred during a hinted handoff. This seems to imply the issue doesn't only happen during `SELECT` queries, but also during `inter-node communication`. Is this correct? The reason it matters, is that this table has a `compound key`, and a regular select will only ever query by the first of those keys, making the amount of tombstones during said query insignificant. — natli, Feb 14 '14 at 08:33
Yes, hinting is a protocol for exchanging information between nodes, but it's an optional feature designed to improve cluster performance during node outages. You can read more at http://www.datastax.com/dev/blog/modern-hinted-handoff . Hints are stored in a system table, so sending one involves doing the slice with all its potential problems with respect to tombstones. — Daniel S., Feb 14 '14 at 19:58
I don't have deep insights into the `HintedHandoffManager`, so I can't say whether getting an inordinate number of tombstones in hint tables is indicative of a bad use pattern. I can only mention that you're not the only one who sees these; see http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Crash-with-TombstoneOverwhelmingException-td7592018.html for a related discussion. If you can correlate these crashes with a particular operation or a periodic task, it may give you a lead as to why are so many hints getting generated in the first place. — Daniel S., Feb 14 '14 at 20:04
Thanks, it's strange though; I use a write consistency of `ONE` so if I understand correctly `hintedhandoff` shouldn't even come into play... maybe it's from writes on an old keyspace, but that's just an uneducated guess. — natli, Feb 15 '14 at 08:57
@natli to your last comment: even with CL of ONE to writes, in case the node that owns the token is down - C* will use hinted-handoff to update it when it comes back up. — Nir Alfasi, Mar 12 '15 at 19:01
@DanielS. Nice answer, I have a question though, In your example, If I query just key == 3, then, is it affected by tombstones too? If then, what if the key is secondary index? is it same? — Lion.k, Aug 14 '18 at 09:34

score 4 · Answer 2 · answered Sep 05 '19 at 12:09

here is a link to full solution:

Clean up tombstones by ensuring gc_grace_seconds is set to run at a more frequent time to suit your application or use TTLs for certain data. For example the default gc_grace_seconds is 864000 (10 days). If your TTL data is set to 6 days then you might want to change gc_grace_seconds to 604800 (7 days) to remove tombstones sooner.

https://support.datastax.com/hc/en-us/articles/204612559-ReadTimeoutException-seen-when-using-the-java-driver-caused-by-excessive-tombstones

cqlsh:results> alter table example with gc_grace_seconds = 10000;

Regards,

Ali

wouldn't this just create tombstones sooner? they still wouldn't be removed until a compaction/repair event happened — xref, Sep 25 '19 at 18:24

What exactly happens when tombstone limit is reached

2 Answers2

Linked