Actually, The PRIMARY KEY is not the same as the partition key. The partition key is a part of the PRIMARY KEY. And yes, it is the part which determines how a row is distributed across the cluster.
how will the data be distributed by username in this case?
If I CREATE your table, insert some values and query it I can get a bit of a window into the distribution process by using the token
function in my SELECT:
> SELECT token(user_name), user_name FROM user2;
system.token(user_name) | user_name
-------------------------+-----------
-5077180869401877077 | Patdard
-4874582970682694928 | Robo
4639906948852899531 | Bill
4645660266327417866 | Bob
4877648712764681009 | Valentina
5726383012007749221 | Helcine
7724711996172375448 | Jebediah
(7 rows)
Let's assume that I have 5 nodes. In Cassandra each node is responsible for a primary token range. Let's assume the following:
1) 5534023222112865485 to -9223372036854775808
2) -9223372036854775807 to -5534023222112865485
3) -5534023222112865484 to -1844674407370955162
4) -1844674407370955161 to 1844674407370955161
5) 1844674407370955161 to 5534023222112865484
Note: Ranges computed by running:
python -c 'print [str(((2**64 / 5) * i) - 2**63) for i in range(5)]'
Also depicted this way in MVP Robbie Strickland's Cassandra High Availability.
Cassandra takes the hashed token value of the partition key (user_name
in this case) and uses that to determine which node the row show be distributed to. Given the hashed token values above, and the ranges that I have listed out, these are the nodes which each user name should go to:
Node 1: Helcine, Jebediah
Node 3: Patdard, Robo
Node 5: Bill, Bob, Valentina
Depending on your replication factor (RF), Cassandra may also place additional replicas of each row on other nodes.