Sorry for my unstructured post. I am doing it the first time here and I am not a developer. We would appreciate any help we could get!! Thank you in advance.
We are handling customer support for a client who bought our product which uses Cassandra as a database. The customer has one Cassandra node and is using a SAN device. We know that it can be a bad practice. I am aware of the following article: https://www.datastax.com/dev/blog/impact-of-shared-storage-on-apache-cassandra
The customer’s Storage (Cassandra database) crashes every 2-10 hours with the following exceptions:
ERROR [PERIODIC-COMMIT-LOG-SYNCER] 2017-12-10 19:54:16,27...
ERROR [PERIODIC-COMMIT-LOG-SYNCER] 2017-12-10 19:54:16,279 JVMStabilityInspector.java:118 - JVM state determined to be unstable. Exiting forcefully due to:
org.apache.cassandra.io.FSWriteError: java.io.IOException: The semaphore timeout period has expired
at org.apache.cassandra.db.commitlog.MemoryMappedSegment.write(MemoryMappedSegment.java:100) ~[apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:296) ~[apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.db.commitlog.CommitLog.sync(CommitLog.java:230) ~[apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.db.commitlog.AbstractCommitLogService$1.run(AbstractCommitLogService.java:93) ~[apache-cassandra-2.2.8.jar:2.2.8]
at java.lang.Thread.run(Unknown Source) [na:1.8.0_151]
Caused by: java.io.IOException: The semaphore timeout period has expired
at java.nio.MappedByteBuffer.force0(Native Method) ~[na:1.8.0_151]
at java.nio.MappedByteBuffer.force(Unknown Source) ~[na:1.8.0_151]
at org.apache.cassandra.utils.SyncUtil.force(SyncUtil.java:113) ~[apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.db.commitlog.MemoryMappedSegment.write(MemoryMappedSegment.java:96) ~[apache-cassandra-2.2.8.jar:2.2.8]
... 4 common frames omitted
My questions are:
- Is it possible to make Cassandra work at the cost of performance? The customer bought the SAN device to use for our product. They are even willing to migrate our product from the existing RAID 5 LUN to a new RAID 10 LUN but I am not sure that it will work.
- Would it be worth trying to tweak some of the configuration parameters for Cassandra and see if the database stops crashing? If yes, then what configuration parameters would affect this issue?
After we monitored the performance data and reviewed the exceptions, we decided to make Cassandra more stable by slowing it down. We changed the parameters that affect concurrent reads and writes. We thought that when the Cassandra database get stable enough, we could start increasing the values a bit. Specifically, we changed the following properties in the Casssandra.yaml file.
commitlog_sync_period_in_ms: 3600000
concurrent_reads: 4
concurrent_writes: 4
concurrent_counter_writes: 4
The Cassandra crashed after 1.5 hours.
**Cassandra.yaml:**
batchlog_replay_throttle_in_kb: 1024
role_manager: CassandraRoleManager
roles_validity_in_ms: 2000
disk_failure_policy: die
disk_access_mode: standard
commit_failure_policy: die
key_cache_save_period: 14400
row_cache_size_in_mb: 0
row_cache_save_period: 0
counter_cache_size_in_mb:0
counter_cache_save_period: 7200
commitlog_sync: periodic
commitlog_sync_period_in_ms: 3600000
commitlog_segment_size_in_mb: 128
concurrent_reads: 4
concurrent_writes: 4
concurrent_counter_writes: 4
file_cache_size_in_mb: 128
memtable_heap_space_in_mb: 128
memtable_offheap_space_in_mb: 128
memtable_allocation_type: heap_buffers
commitlog_total_space_in_mb: 1024
index_summary_resize_interval_in_minutes: 60
trickle_fsync: false
trickle_fsync_interval_in_kb: 10240
storage_port: 7100
thrift_framed_transport_size_in_mb: 160
incremental_backups: false
column_index_size_in_kb: 64
batch_size_warn_threshold_in_kb: 5
batch_size_fail_threshold_in_kb: 50
unlogged_batch_across_partitions_warn_threshold: 10
server_encryption_options:
internode_encryption: none
keystore: conf/.keystore
keystore_password: cassandra
truststore: conf/.truststore
truststore_password: cassandra
client_encryption_options:
enabled: true
optional: false
require_client_auth: true
Customer environment:
ReleaseVersion: 2.2.8
Windows 2012 R2
Java 1.8.0_151
Resource Monitor of disk where Cassandra storage is located
Perfmon data