Cassandra Leveled Compaction Strategy requires 50% disk space overhead

Question

We have a 24 node AWS cluster (i2.xlarge) running Cassandra 2.2.5. We have one large table and a few smaller ones. The large table consumes most of the disk space. Disk usage is unexpectedly increasing.

We are using LCS and have noticed that the SSTables are not getting moved into higher levels.

cfstats on the table shows us that the SSTables do not appear to be getting compacted into higher levels:

    SSTables in each level: [2, 20/10, 206/100, 2146/1000, 1291, 0, 0, 0, 0]

The dataset finished loading about a month ago and the disk usage was 60-65%. We are updating the dataset and disk usage is going up by about 0.5% per day. We are currently seeing 75-80% full on the nodes. Rows are being updated, but there are no new rows and no rows being deleted. So we did not expect that the disk usage would be going up. Our best guess is that compactions are no longer removing duplicates from the sstables.

When trying to force a compaction on the dataset (nodetool compact), we get an error about insufficient disk space.

    "error: Not enough space for compaction, estimated sstables = 1977, expected write size = 331746061359"

Documentation on LCS claims that "Only enough space for 10x the sstable size needs to be reserved for temporary use by compaction." In our case it looks like the compaction requires 1977x 160MB.

We did come across a suggestion to reset LCS compaction levels: Leveled Compaction Strategy with low disk space

However, when we tried this on a smaller cluster with a smaller dataset with the same issue, the compactions it proceeded with appear to need a huge amount of space also, not just the 1.6G promised.

Before:

    SSTables in each level: [1, 20/10, 202/100, 7, 0, 0, 0, 0, 0]
    Space used (live): 38202690995

After executing sstablelevelreset:

    SSTables in each level: [231/4, 0, 0, 0, 0, 0, 0, 0, 0]
    Space used (live): 38258539433

The first compaction after that started compressing 21698490019 bytes. That appears to be about 129 sstables worth of data.

On the small cluster we have enough extra disk space, but on the big one, there does not appear to be enough room to either force a compaction or to get compactions to start over by using the sstablelevelreset utility.

After the compactions finished, this is what the sstable levels look like (note that documents are continually being updated, but not added to the database):

    SSTables in each level: [0, 22/10, 202/100, 13, 0, 0, 0, 0, 0]
    Space used (live): 39512481279

Is there anything else we can do to try and recover disk space? Or at least to keep the disk usage from climbing?

The table is defined as follows:

CREATE TABLE overlordnightly.document (
    id bigint PRIMARY KEY,
    del boolean,
    doc text,
    ver bigint
) WITH bloom_filter_fp_chance = 0.1
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.DeflateCompressor'}
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

Full cfstats from one of the nodes:

Keyspace: overlordprod
    Read Count: 68000539
    Read Latency: 3.948187530190018 ms.
    Write Count: 38569748
    Write Latency: 0.02441453179834102 ms.
    Pending Flushes: 0
        Table: document
        SSTable count: 3283
        SSTables in each level: [0, 22/10, 210/100, 2106/1000, 943, 0, 0, 0, 0]
        Space used (live): 526180595946
        Space used (total): 526180595946
        Space used by snapshots (total): 0
        Off heap memory used (total): 2694759044
        SSTable Compression Ratio: 0.22186642596102463
        Number of keys (estimate): 118246721
        Memtable cell count: 45944
        Memtable data size: 512614744
        Memtable off heap memory used: 0
        Memtable switch count: 1994
        Local read count: 68000545
        Local read latency: 4.332 ms
        Local write count: 38569754
        Local write latency: 0.027 ms
        Pending flushes: 0
        Bloom filter false positives: 526
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 2383928304
        Bloom filter off heap memory used: 2383902040
        Index summary off heap memory used: 24448020
        Compression metadata off heap memory used: 286408984
        Compacted partition minimum bytes: 87
        Compacted partition maximum bytes: 12108970
        Compacted partition mean bytes: 16466
        Average live cells per slice (last five minutes): 1.0
        Maximum live cells per slice (last five minutes): 1
        Average tombstones per slice (last five minutes): 1.0
        Maximum tombstones per slice (last five minutes): 1

Here's something that appears wrong about the compactions that are occurring. Here's one in particular:

DEBUG [CompactionExecutor:1146] 2016-07-26 08:49:02,333 CompactionTask.java:142 - Compacting (cd2baa50-530d-11e6-9c8e-b5e6d88d6e11) [
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12943-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12970-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12972-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12953-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12955-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12957-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12978-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12976-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-4580-big-Data.db:level=4,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-14528-big-Data.db:level=2,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12949-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12959-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12974-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12962-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-11516-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12941-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12968-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12951-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12983-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12947-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12966-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12945-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12964-big-Data.db:level=3,
 ]

If you notice there are 23 sstables being compacted: one from level 2, one from level 4, and the rest from level 3. In this case, it also needed more that 10x space (3,720,676,532 bytes to 3,531,157,508). It ends up compacting these to level 3, but I was under the impression that tables only went up in level. Why is a level 4 table being compacted to level 3? Now that I've noticed this in the logs, I see that it's a frequent occurrence. For example, here's another from around the same time:

DEBUG [CompactionExecutor:1140] 2016-07-26 08:46:47,420 CompactionTask.java:142 - Compacting (7cbb0390-530d-11e6-9c8e-b5e6d88d6e11) [
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12910-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-14524-big-Data.db:level=2,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12908-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-12906-big-Data.db:level=3,
 /data/cassandra/overlordprod/document-57ed497007c111e6a2174fb91d61e383/la-3543-big-Data.db:level=4,
 ]

I don't know if this is a problem or not.

can you include your schema and how/what your inserting? Possibly the data is something that cannot be consolidated. Do you rely on tombstones? Possibly your compactions cant keep up (206/100, 2146/1000) — Chris Lohfink, Jul 12 '16 at 18:34
Schema is now included above. Rows are being inserted/updated from a java application with the datastax driver, one row at a time. There are no tombstones, we actually set a delete flag rather than deleting rows (we need to preserve the content). — , Jul 12 '16 at 19:11
can you include full cfstats? if you have a unique id for them its going to create a lot of small partitions and it wouldnt be able to really merge them. Why are you use DeflateCompressor? LZ4Compressor could reduce CPU load on compactions and help it catch up. Do you delete anything? Have you tried increasing your compaction throughput(`nodetool setcompactionthroughput`) ? — Chris Lohfink, Jul 12 '16 at 19:19
cstats for the table for one of the nodes included above, the rest of them are similar. Each row has a unique key, to match the queries. But how would that make the sstables unmergable? We use Deflate because we saw that we actually get a better compression ratio than with LZ4. We haven't deleted anything from the table, so there should be no tombstones. Right now compaction throughput is set to 64, but there are not more than a handful of pending compactions, and there are often no compactions pending when there is a lull in updates, so it does not appear to even try to catch up. — , Jul 12 '16 at 20:26
[I think](https://stackoverflow.com/questions/29038041/low-ttl-with-leveled-compaction-should-i-reduce-gc-grace-seconds-to-improve-rea) `gc_grace_seconds` will help you — deFreitas, Aug 10 '17 at 18:06

Cassandra Leveled Compaction Strategy requires 50% disk space overhead

0 Answers0