5

Currently, I have a cassandra column family with large rows of data, to say more than 100,000. Now, I'd like to remove all data in this column family and the problem came up:

After all data is removed, I execute a lookup query in this column family, the cassandra will take tens of seconds to return a empty query result. And the time cost will increase Linearly when the original data is larger

It is caused by the tombstone feature while deleting data from the cassandra database. The lookup speed won't recover to normal until the next GC is fired. See Cassandra Distributed Deletes.

Because such query operations are frequently used in my system, I cannot bear the huge latency up to a few seconds.

Would you please give me a solution to this problem?

Fify
  • 131
  • 1
  • 8
  • Maybe use [time series model](https://academy.datastax.com/resources/getting-started-time-series-data-modeling) can be a good approach? – deFreitas Aug 10 '17 at 13:03

2 Answers2

3

This sounds like a very bad way to use a database. Populate it, empty it, repeat. One way you can solve your problem is by using different CF names each time, as in when you empty the data and start repopulating it, create a new column family and use that and just drop the other colum family however this is hacky.

I'd suggest using compaction (gets rid of all the tombstones it can detect) to solve your problem, it is CPU intensive but it's better than waiting for tens of seconds for queries to respond. You can make the task less intensive on your machine by providing the specific ks & cf you want to compact:

./nodetool compact <ks_name> <cf_name>

Ritchard's point is a good one, gc_grace_seconds is set to 10 days by default so you will probably have to tweak this to allow for compaction to get rid of tombstones.

Lyuben Todorov
  • 13,987
  • 5
  • 50
  • 69
  • 1
    Note that compaction will only remove the tombstones after gc_grace_seconds has elapsed since the tombstone was inserted. – Richard Sep 26 '13 at 10:27
  • @Lyuben, I cannot empty the whole column family, because there are more than 1000 users whose data are stored in it, and each of them has more than 100,000 rows of data. Each deletion operation is executed on a single user's data. The **compact** operation on column family may be a choice, but **what's the time to trigger this operation?** If it is triggered each time one user deletes some data, it may effect all other users. What's your suggestion on this? Thanks again! And thank Richard for reminding of _gc_grace_seconds_. – Fify Sep 29 '13 at 02:34
0

@Fify

If your column family is frequently modified (read then update then read the update again...), you should use the leveled compaction strategy

To make deleted columns removed quickier, change the property gc_grace_seconds of your column family

doanduyhai
  • 8,712
  • 27
  • 26
  • thanks for your reply. 1) My column family's mostly used operations are _insertion_, and then _read_, _deletion_ sometimes happened but with very low probability (let's say 1 out of 100 operations). 2) The **gc_grace_seconds** cannot be too short because there are several TBs' data stored in the database. – Fify Sep 29 '13 at 01:35