We're experiencing issues with constinuously running java applications that update counters in Cassandra. From monitoring the load of the servers we don't see any correlations with the load. The queries are quite constant, because they update values in only 8 different tables. Every minute the java applications fires thousands of queries (can be 20k or even 50k queries), but every once in a while some of those fail. When that happens we write them to a file, along with the exception message. This message is always
Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)
We did some googling and troubleshooting and took several actions:
- Changed the retry policy in the java applications to
DefaultRetryPolicy
instead of theFallthroughRetryPolicy
, to have the client retry a query on failure. - Changed the
write_request_timeout_in_ms
setting on the Cassandra nodes from the standard value of2000
to4000
and then to10000
.
These actions diminished the number of failing queries, but they still occur. From the millions of queries that are executed on an hourly basis, we see about 2000 failing queries over a period of 24 hours. All have the same exception listed above, and they occur at varying times.
Of course we see from the logs that when queries do fail, it takes a while, because it's waiting for a time out and performs retries.
Some facts:
- We run Cassandra v2.2.5 (recently upgraded from v2.2.4)
- We have a geo aware Cassandra cluster with 6 nodes: 3 in Europe, 3 in US.
- The java applications that fire queries are the only clients that communicate with Cassandra (for now).
- The number of java applications is 10: 5 in EU, 5 in US.
- We execute all queries asynchronously (
session.executeAsync(statement);
) and keep track of which individual queries by adding callbacks for success and failure. - The replication factor is 2.
- The replication factor is 2.
- We run Oracle Java 1.7.0_76
Java(TM) SE Runtime Environment (build 1.7.0_76-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.76-b04, mixed mode)
- The 6 Cassandra nodes run on bare metal with the following specs:
- Storage is a group of SSDs in raid 5.
- Each node has 2x (6 core) Intel Xeon E5-2620 CPU's @ 2.00GHz (totalling the number of hardware threads to 24).
- The RAM size is 128GB.
How we create the cluster:
private Cluster createCluster() {
return Cluster.builder()
.addContactPoints(contactPoints)
.withRetryPolicy(DefaultRetryPolicy.INSTANCE)
.withLoadBalancingPolicy(getLoadBalancingPolicy())
.withReconnectionPolicy(new ConstantReconnectionPolicy(reconnectInterval))
.build();
}
private LoadBalancingPolicy getLoadBalancingPolicy() {
return DCAwareRoundRobinPolicy.builder()
.withUsedHostsPerRemoteDc(allowedRemoteDcHosts) // == 3
.build();
}
How we create the keyspace:
CREATE KEYSPACE IF NOT EXISTS traffic WITH REPLICATION = { 'class': 'NetworkTopologyStrategy', 'AMS1': 2, 'WDC1': 2};
Example table (they all look similar)
CREATE TABLE IF NOT EXISTS traffic.per_node (
node text,
request_time timestamp,
bytes counter,
ssl_bytes counter,
hits counter,
ssl_hits counter,
PRIMARY KEY (edge, request_time)
) WITH CLUSTERING ORDER BY (request_time DESC)
AND compaction = {'class': 'DateTieredCompactionStrategy'};