JVM state determined to be unstable

Question

Sorry for my unstructured post. I am doing it the first time here and I am not a developer. We would appreciate any help we could get!! Thank you in advance.

We are handling customer support for a client who bought our product which uses Cassandra as a database. The customer has one Cassandra node and is using a SAN device. We know that it can be a bad practice. I am aware of the following article: https://www.datastax.com/dev/blog/impact-of-shared-storage-on-apache-cassandra
The customer’s Storage (Cassandra database) crashes every 2-10 hours with the following exceptions:

  ERROR [PERIODIC-COMMIT-LOG-SYNCER] 2017-12-10 19:54:16,27... 
  ERROR [PERIODIC-COMMIT-LOG-SYNCER] 2017-12-10 19:54:16,279 JVMStabilityInspector.java:118 - JVM state determined to be unstable.  Exiting forcefully due to:
                org.apache.cassandra.io.FSWriteError: java.io.IOException: The semaphore timeout period has expired 
                        at org.apache.cassandra.db.commitlog.MemoryMappedSegment.write(MemoryMappedSegment.java:100) ~[apache-cassandra-2.2.8.jar:2.2.8]
                        at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:296) ~[apache-cassandra-2.2.8.jar:2.2.8]
                        at org.apache.cassandra.db.commitlog.CommitLog.sync(CommitLog.java:230) ~[apache-cassandra-2.2.8.jar:2.2.8] 
                        at org.apache.cassandra.db.commitlog.AbstractCommitLogService$1.run(AbstractCommitLogService.java:93) ~[apache-cassandra-2.2.8.jar:2.2.8]
                        at java.lang.Thread.run(Unknown Source) [na:1.8.0_151] 
                Caused by: java.io.IOException: The semaphore timeout period has expired 
                        at java.nio.MappedByteBuffer.force0(Native Method) ~[na:1.8.0_151] 
                        at java.nio.MappedByteBuffer.force(Unknown Source) ~[na:1.8.0_151] 
                        at org.apache.cassandra.utils.SyncUtil.force(SyncUtil.java:113) ~[apache-cassandra-2.2.8.jar:2.2.8] 
                        at org.apache.cassandra.db.commitlog.MemoryMappedSegment.write(MemoryMappedSegment.java:96) ~[apache-cassandra-2.2.8.jar:2.2.8]
                        ... 4 common frames omitted

My questions are:

Is it possible to make Cassandra work at the cost of performance? The customer bought the SAN device to use for our product. They are even willing to migrate our product from the existing RAID 5 LUN to a new RAID 10 LUN but I am not sure that it will work.
Would it be worth trying to tweak some of the configuration parameters for Cassandra and see if the database stops crashing? If yes, then what configuration parameters would affect this issue?

After we monitored the performance data and reviewed the exceptions, we decided to make Cassandra more stable by slowing it down. We changed the parameters that affect concurrent reads and writes. We thought that when the Cassandra database get stable enough, we could start increasing the values a bit. Specifically, we changed the following properties in the Casssandra.yaml file.

                  commitlog_sync_period_in_ms: 3600000
                   concurrent_reads: 4
                   concurrent_writes: 4
                   concurrent_counter_writes: 4

The Cassandra crashed after 1.5 hours.

                 **Cassandra.yaml:**

    batchlog_replay_throttle_in_kb: 1024
    role_manager: CassandraRoleManager
    roles_validity_in_ms: 2000
    disk_failure_policy: die
    disk_access_mode: standard 
    commit_failure_policy: die
    key_cache_save_period: 14400
    row_cache_size_in_mb: 0
    row_cache_save_period: 0
    counter_cache_size_in_mb:0
    counter_cache_save_period: 7200
    commitlog_sync: periodic
    commitlog_sync_period_in_ms: 3600000
    commitlog_segment_size_in_mb: 128
    concurrent_reads: 4
    concurrent_writes: 4
    concurrent_counter_writes: 4
    file_cache_size_in_mb: 128
    memtable_heap_space_in_mb: 128
    memtable_offheap_space_in_mb: 128
    memtable_allocation_type: heap_buffers
    commitlog_total_space_in_mb: 1024
    index_summary_resize_interval_in_minutes: 60
    trickle_fsync: false
    trickle_fsync_interval_in_kb: 10240
    storage_port: 7100
    thrift_framed_transport_size_in_mb: 160
    incremental_backups: false
    column_index_size_in_kb: 64
    batch_size_warn_threshold_in_kb: 5
    batch_size_fail_threshold_in_kb: 50
    unlogged_batch_across_partitions_warn_threshold: 10    
    server_encryption_options:
        internode_encryption: none
        keystore: conf/.keystore
        keystore_password: cassandra
        truststore: conf/.truststore
        truststore_password: cassandra
        client_encryption_options:
        enabled: true
        optional: false
        require_client_auth: true


 Customer environment:
    ReleaseVersion: 2.2.8
    Windows 2012 R2
    Java 1.8.0_151

Resource Monitor of disk where Cassandra storage is located
Perfmon data

What is the size of the JVM Heap, and how much RAM is available? — Aaron, Jan 16 '18 at 15:24
And although Windows may be *supported* for Cassandra as of 2.2, the best chance for running a successful, stable cluster is still to deploy it on Linux. — Aaron, Jan 16 '18 at 15:26
Our product does not run on UNIX. JVM heap is 10 GB. Machine RAM is 64 GB — MGold, Jan 16 '18 at 20:11
How is the garbage collection configured? Is it using G1GC in cassnadra-env.sh? — dilsingi, Jan 16 '18 at 21:19
So that’s the problem right there and once you switch to G1GC you garbage collector and hence your jvm will be much stable — dilsingi, Jan 17 '18 at 05:14
Thank you for the suggestion. The customer just migrated product to new raid 10 lun so we will try it. — MGold, Jan 18 '18 at 16:17

JVM state determined to be unstable

0 Answers0