Query in DataStax

Question

I have 2 questions related to DataStax queries:

I have a installed DataStax Enterprise 4.6 on 3 nodes of exactly the same configuration with regards to CPU,RAM,Storage etc. I then created a keyspace with RF=3, created a CF within the keyspace and inserted about 10 million rows in it. Now when I login to Node1 and execute a count query, it returns about 1.5 million in about 1mt 15 secs. But when I login to Node2 and execute the exact same query, it take about 1mt 35 secs. Similarly, when I login to Node3 and execute, it takes about 1mt 20 secs. Why is there a difference in the query execution times on the 3 nodes?
I shut down DSE (service dse stop) on Node2 & Node3 and ran the query on Node1. Since all required data is available on Node1, it ran successfully and took 1mt 15sec. I then brought DSE up on Node2 and ran the query again. With tracing on, I see that data is being fetched from Node2 as well but the time taken to execute the query is more than 1mt 15sec. Should it not be less, since 2 nodes are being used? Similarly, when Node3 is also brought up and the query is executed, it takes more time compared to when 2 nodes are up. My understanding is that Cassandra/DataStax is linearly scalable.

Any help/pointers is much appreciated ..

"I have 2 questions" then ask 2 separate questions. – Raedwald Aug 07 '15 at 13:05 — Raedwald, Aug 07 '15 at 13:05

score 0 · Answer 1 · answered Aug 07 '15 at 12:23

Sounds like normal behavior to me. There is always some overhead when multiple nodes are coordinating and interacting with each other, and things are not necessarily going to behave in a perfectly symmetric way.

Even if all the data is local, there's still some interaction with the other nodes going on, and some of that will be non deterministic in time. You have network latencies that vary, different queueing orders of things, variable seeks times on disks, etc.

When you take two of the nodes down, the remaining node knows that they are down and so it doesn't bother trying to do any reads or interactions with them. That's why that scenario is the fastest. As you bring the other nodes back online, the extra coordination with them will slow things down a little. That's the price you pay for redundancy.

The performance scales by not keeping a copy of the data on every node. You are using RF=3 and only have three nodes. If you added a fourth node, then not all the data would be on every node. Now you have added capacity since not every write goes to all nodes and different writes will hit a different set of machines.

score 0 · Answer 2 · edited May 23 '17 at 12:06

Your question is simple to answer. It is a matter of Consistency: You can tune your select queries with a Consistency of One, then C* does not need to check if your data (RF=3) across all the nodes matches up. In most use cases a Consistency of One for reads should be sufficient. As for the time differences: The machines are involved in many different things beside serving queries. So normal behaviour to have different response times per node. There is a similar question/answer here : How do I set the consistency level of an individual CQL query in CQL3? Basically go and play with consistency and see how response times change.

Query in DataStax

2 Answers2