Token Aware Astyanax Connection pool connecting on nodes without distributing connections over nodes

Question

I was using astyanax connection pool defined as this:

ipSeeds = "LOAD_BALANCER_HOST:9160";
conPool.setSeeds(ipSeeds)
.setDiscoveryType(NodeDiscoveryType.TOKEN_AWARE)
.setConnectionPoolType(ConnectionPoolType.TOKEN_AWARE);

However, my cluster have 4 nodes and I have 8 client machines connecting on it. LOAD_BALANCER_HOST forwards requests to one of my four nodes.

On a client node, I have:

$netstat -an | grep 9160 | awk '{print $5}' | sort |uniq -c
    235 node1:9160
    680 node2:9160
      4 node3:9160
      4 node4:9160

So although the ConnectionPoolType is TOKEN_AWARE, my client seems to be connecting mainly to node2, sometimes to node1, but almost never to nodes 3 and 4.
Question is: Why is this happening? Shouldn't a token aware connection pool query the ring for the node list and connect to all the active nodes using round robin algorithm?

What replication factor have you configured for your keyspace? Which replication strategy did you choose? Are you sure your clients are accessing wide range of row keys, or are they mainly focusing on one or two in your test? — William Price, Apr 26 '14 at 18:00
I am using replication factor 2 and replication strategy network. — mvallebr, Apr 28 '14 at 18:13
I am sure I am using a wide range of row keys, but where the row will be stored should not have anything to do with the connections, right? Or I am wrong about that? — mvallebr, Apr 28 '14 at 18:14
A "token aware" strategy should attempt to contact only the nodes that actually store the data. You're using RF=2, so for any given token (hash of the row key) there are two(2) out of your four(4) nodes that hold the data for that token. For any given request, a properly operating token-aware strategy on your client would only contact one of those two nodes. Token-aware is not the same thing as a simple round-robin. — William Price, May 02 '14 at 19:19
Earlier I didn't ask what kind of partitioner you were using; this is also important. Out of the box, Cassandra uses a partitioner that stores data randomly around your ring based on the hash of your row key. If you are indeed accessing a wide range of row keys, I'd normally expect the load to be balanced across the cluster. However, **IF** you configured an order-preserving partitioner then your rows won't be evenly distributed and you'll get hot spots. — William Price, May 02 '14 at 19:23

score 2 · Accepted Answer · answered Jun 11 '14 at 21:32

William Price is totally right: the fact you're using a TokenAwarePolicy and possibly a default Partitioner means that - first your data will be stored biased across your nodes and - then on querying the LoadbalancingPolicy makes your driver remember the correct nodes to ask for

You can improve your cluster's performance by using some deviating or may be a custom partitioner to equally distribute your data. To randomly query nodes use either

RoundRobinPolicy (http://www.datastax.com/doc-source/developer/java-apidocs/com/datastax/driver/core/policies/RoundRobinPolicy.html) or
DatacenterAwareRoundRobinPolicy (http://www.datastax.com/doc-source/developer/java-apidocs/com/datastax/driver/core/policies/DCAwareRoundRobinPolicy.html).

The latter, of course, needs the definition of data centers in your keyspace.

Without any further information I would suggest to just change the partitioner as a TokenAware load balancing policy is usually a good idea. The main load will end up on these nodes in the end -- the TokenAware policy get's you to the right coordinator just quicker.

Token Aware Astyanax Connection pool connecting on nodes without distributing connections over nodes

1 Answers1

Linked