11

I have a 5 node cluster of Cassandra, with ~650 GB of data on each node involving a replication factor of 3. I have recently started seeing the following error in /var/log/cassandra/system.log.

INFO [ReadStage-5] 2017-10-17 17:06:07,887 NoSpamLogger.java:91 - Maximum memory usage reached (1.000GiB), cannot allocate chunk of 1.000MiB

I have attempted to increase the file_cache_size_in_mb, but sooner rather than later this same error catches up. I have tried to go as high as 2GB for this parameter, but to no avail.

When the error happens, the CPU utilisation soars and the read latencies are terribly erratic. I see this surge show up approximated every 1/2 hour. Note the timings in the list below.

INFO [ReadStage-5] 2017-10-17 17:06:07,887 NoSpamLogger.java:91 - Maximum memory usage reached (1.000GiB), cannot allocate chunk of 1.000MiB INFO [ReadStage-36] 2017-10-17 17:36:09,807 NoSpamLogger.java:91 - Maximum memory usage reached (1.000GiB), cannot allocate chunk of 1.000MiB INFO [ReadStage-15] 2017-10-17 18:05:56,003 NoSpamLogger.java:91 - Maximum memory usage reached (2.000GiB), cannot allocate chunk of 1.000MiB INFO [ReadStage-28] 2017-10-17 18:36:01,177 NoSpamLogger.java:91 - Maximum memory usage reached (2.000GiB), cannot allocate chunk of 1.000MiB

Two of the tables that I have are partitioned by hour, and the partitions are large. Ex. Here are their outputs from nodetool table stats

    Read Count: 4693453
    Read Latency: 0.36752741680805157 ms.
    Write Count: 561026
    Write Latency: 0.03742310516803143 ms.
    Pending Flushes: 0
        Table: raw_data
        SSTable count: 55
        Space used (live): 594395754275
        Space used (total): 594395754275
        Space used by snapshots (total): 0
        Off heap memory used (total): 360753372
        SSTable Compression Ratio: 0.20022598072758296
        Number of keys (estimate): 45163
        Memtable cell count: 90441
        Memtable data size: 685647925
        Memtable off heap memory used: 0
        Memtable switch count: 1
        Local read count: 0
        Local read latency: NaN ms
        Local write count: 126710
        Local write latency: 0.096 ms
        Pending flushes: 0
        Percent repaired: 52.99
        Bloom filter false positives: 167775
        Bloom filter false ratio: 0.16152
        Bloom filter space used: 264448
        Bloom filter off heap memory used: 264008
        Index summary off heap memory used: 31060
        Compression metadata off heap memory used: 360458304
        Compacted partition minimum bytes: 51
        **Compacted partition maximum bytes: 3449259151**
        Compacted partition mean bytes: 16642499
        Average live cells per slice (last five minutes): 1.0005435888450147
        Maximum live cells per slice (last five minutes): 42
        Average tombstones per slice (last five minutes): 1.0
        Maximum tombstones per slice (last five minutes): 1
        Dropped Mutations: 0



    Read Count: 4712814
    Read Latency: 0.3356051004771247 ms.
    Write Count: 643718
    Write Latency: 0.04168356951335834 ms.
    Pending Flushes: 0
        Table: customer_profile_history
        SSTable count: 20
        Space used (live): 9423364484
        Space used (total): 9423364484
        Space used by snapshots (total): 0
        Off heap memory used (total): 6560008
        SSTable Compression Ratio: 0.1744084338623116
        Number of keys (estimate): 69
        Memtable cell count: 35242
        Memtable data size: 789595302
        Memtable off heap memory used: 0
        Memtable switch count: 1
        Local read count: 2307
        Local read latency: NaN ms
        Local write count: 51772
        Local write latency: 0.076 ms
        Pending flushes: 0
        Percent repaired: 0.0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 384
        Bloom filter off heap memory used: 224
        Index summary off heap memory used: 400
        Compression metadata off heap memory used: 6559384
        Compacted partition minimum bytes: 20502
        **Compacted partition maximum bytes: 4139110981**
        Compacted partition mean bytes: 708736810
        Average live cells per slice (last five minutes): NaN
        Maximum live cells per slice (last five minutes): 0
        Average tombstones per slice (last five minutes): NaN
        Maximum tombstones per slice (last five minutes): 0
        Dropped Mutations: 0

Here goes:

cdsdb/raw_data histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)                  
50%             0.00             61.21              0.00           1955666               642
75%             1.00             73.46              0.00          17436917              4768
95%             3.00            105.78              0.00         107964792             24601
98%             8.00            219.34              0.00         186563160             42510
99%            12.00            315.85              0.00         268650950             61214
Min             0.00              6.87              0.00                51                 0
Max            14.00           1358.10              0.00        3449259151           7007506

cdsdb/customer_profile_history histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)                  
50%             0.00             73.46              0.00         223875792             61214
75%             0.00             88.15              0.00         668489532            182785
95%             0.00            152.32              0.00        1996099046            654949
98%             0.00            785.94              0.00        3449259151           1358102
99%             0.00            943.13              0.00        3449259151           1358102
Min             0.00             24.60              0.00              5723                 4
Max             0.00           5839.59              0.00        5960319812           1955666

Could you please suggest a way forward to mitigate this issue?

Varsha
  • 259
  • 1
  • 2
  • 10

1 Answers1

14

Based on the cfhistograms output posted, the partitions are enormous.

95% percentile of raw_data table has partition size of 107MB and max of 3.44GB. 95% percentile of customer_profile_history has partition size of 1.99GB and max of 5.96GB.

This clearly relates to the problem you notice every half-hour as these huge partitions are written to the sstable. The data-model has to change and based on the partition size above its better to have a partition interval as "minute" instead of "hour". So a 2GB partition would reduce to 33MB partition.

Recommended partition size is to keep it as close to 100MB maximum. Though theoretically we can store more than 100MB, the performance is going to suffer. Remember every read of that partition is over 100MB of data through the wire. In your case, its over 2GB and hence all the performance implications along with it.

dilsingi
  • 2,938
  • 14
  • 24
  • Wonderful!! Thanks for pointing this out. On that same note, what is the easiest way to re-partition a table this large? Do I need to create a new table and reingest the data in there. Or is there a better way? – Varsha Oct 24 '17 at 04:00
  • Oddly the bigger partition table "customer_profile_history" has only about 10GB of data and the "raw_data" table whose partition could still be improved is about 594GB. So I would first concentrate on recreating the former table and load the data using bulk loader (also known as sstableloader). You could use the same process to reload the second table data as well or consider using spark loads if you are familiar. – dilsingi Oct 24 '17 at 04:39
  • @Varsha If there isn't any follow up questions, remember to accept the answer (the tick mark). – dilsingi Oct 25 '17 at 13:42
  • thanks for your inputs this far. I was actually giving the solution a spin. I changed my partition to minute-wise instead of hour-wise. My application is such that it ordinarily does not access any data of the past. So I expected these errors to disappear once I deployed the new change. But they continue to exist. It has been over 12 hours since I deployed my changes. Do you have an incling why this might be? – Varsha Oct 26 '17 at 03:26
  • Can you clarify if you had truncated the past data? It goes through compaction cycles, so even though your application isn't actually accessing them, Cassandra is. I hope you have cassandra.yaml auto_snapshot property set to default true, which means that if you drop the table it will keep a snapshot. This way Cassandra nor application doesn't access it anymore. This can be a temporary drop till you test out the app. – dilsingi Oct 26 '17 at 03:31
  • I am going to truncate one of my two tables later today, because I can truly live without that data. The other table though has a lot of data, close to 650GB on each node. How advisable is it to snapshot and then restore from snapshot? – Varsha Oct 26 '17 at 03:58
  • When you issue truncate it literally creates a directory called snapshot within the existing table data directory and flips the contents into there. So it’s within seconds irrespective of size of table. Try it out in a lower env. – dilsingi Oct 26 '17 at 04:02
  • I have not truncated the older data yet, because it needs application level changes. However I have run experiments to be convinced that the RCA for the memory error was indeed the large partitions. So thank you very much for all your inputs on this issue. – Varsha Oct 31 '17 at 03:25
  • Where can we update partition interval for table and what would be the steps to be taken after updating configuration? – vishal paalakurthi Jan 07 '20 at 12:18
  • @dilsingi Please add suggestion for above question? – vishal paalakurthi Mar 17 '20 at 09:39
  • @vishalpaalakurthi the partition interval or rather partition key is based on your own table and use case. If you could elaborate your use case and details about the table, it would be easier to explain – dilsingi Mar 17 '20 at 16:11