6

My time series data TTLs after 1-7 days (depends on the use case). The data is immutable and ordered by timestamp (cluster by timestamp) - data is timestamped "on-write" (so new data timestamps should always be progressive)

The partition size should not exceed 10K items - usually much less ( and at most ~10MB for a full 10k items).

I didn't find any good documentation on how the compaction strategy should be configured (what parameters to take into account) so I just decided to do it like this:

compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '7', 'compaction_window_unit': 'DAYS'}

Definitely not sure that this is correct

What are the KPI I should be taking into account?

Avba
  • 14,822
  • 20
  • 92
  • 192

2 Answers2

13

There is no single right answer:

As a result of your configuration, data will be compacted together if it was inserted in the last 7 days. The biggest advantage of TWCS is that it can expire entire SSTables without even reading them because it knows that all the data inside the SSTable is already expired.

In this case, the data that you TTLd in 1 day cannot be expired yet, because it will be lumped together in a 7 day window. In the worst case, your SSTable will have a mutation that was just inserted in the end of the 7-day window, so the entire SSTable will be kept around for 7 more days until that one mutation expires.

This sounds suboptimal, but at least you will be able to serve all your reads for data in that window from a single SSTable. Going the other way around, you would set, for instance, the window to one day. This would make your data expire a lot faster but for the data that is alive for 7 days you would now be touching 7 SSTables instead of one.

Summary:

Larger time windows: slower expiration, faster reads for live data Smaller time windows: faster expiration, slower reads for live data.

As with most things in life, the truth is in the middle! While both options would work - and you now understand the trade offs, the best window is probably somewhere in the middle of 1 and 7.

Glauber Costa
  • 676
  • 3
  • 5
  • Is there an issue with having very small time windows if the data is aggregated to minute buckets? That is to say that every minute a new primary key is created. In this case is it reasonable to have a time window of let's say 1 hour? That would entail having hundreds of SSTables for a 7 day period. – Avba Oct 10 '18 at 06:30
  • The issue is that you will be left with one SSTable per bucket. If you have too many buckets, your reads may become too expensive as you will have to touch many SSTables to serve a read. – Glauber Costa Oct 11 '18 at 23:31
0

TTL creates tombstones in the sstables which are removed by compaction. Too much tombstones will hardly effect your read performances.

So in your case it worth monitoring the number of tombstones per read, with nodetool tablestats or JMX.

See this nice article about deleting tombstones in cassandra.

barth
  • 431
  • 2
  • 5