1

I have a service that connects to our Cassandra cluster and executes tens of thousands of queries per day using Lightweight (ACID) Transactions to implement the Consensus system desribed here. For the most part it works fine, but sporadically, the writes will fail with the error saying "Operation timed out - received only 1 responses" (or less commonly, only 0 responses). We're using the Datastax Python driver. When the error occurs, the full error line (at the end of the stack trace) reads:

WriteTimeout: Error from server: code=1100 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 1 responses." info={'received_responses': 1, 'required_responses': 2, 'consistency': 'LOCAL_SERIAL'}

Is this something that seems expected to occur from time to time in a production Cassandra setup? Or does it seem like something where we could have a configuration problem with our Cassandra cluster or network?

Some information about our Cassandra cluster: It is an 8-node setup spread across 2 Amazon EC2 regions (4 nodes per region). All of the nodes are running version 3.3.0 of the Datastax Cassandra distribution.

Devin
  • 996
  • 2
  • 8
  • 19
  • Same problem here, just to add: all SO answers related to this problem are about increasing timeouts. In my case WriteTimeout is received 10-100ms seconds after the request is started, and my all timeouts are >10s in cassandra yaml config. I think this could also be Datastax Python cassandra client problem.. – Michal Mar 20 '17 at 13:49
  • FYI, none of those are helpful: [answer 1](http://stackoverflow.com/questions/30575125/coordinator-node-timed-out-waiting-for-replica-nodes-in-cassandra-datastax-while), [answer 2](http://stackoverflow.com/questions/33194860/cassandra-coordinator-node-timed-out-waiting-for-replica-nodes-responses?noredirect=1&lq=1) – Michal Mar 20 '17 at 13:53

1 Answers1

4

From https://issues.apache.org/jira/browse/CASSANDRA-9328

There is cases where under contention the coordinator loses track of whether the value it submitted to Paxos might be applied or not (see CASSANDRA-6013). At which point we can't do anything else that answering "sorry I don't know". And since a WriteTimeoutException already means "I don't know", we throw it in that case too, even though it's not a proper timeout per-se

Michal
  • 2,078
  • 19
  • 29