How do partition keys work?

Question

I am new to Cassandra and I read that the primary key is the same thing as the partition key.

My question is simple, in this case:

CREATE TABLE users (
  user_name varchar PRIMARY KEY,
  password varchar,
  gender varchar,
  session_token varchar,
  state varchar,
  birth_year bigint
);

As the partition key is responsible for data distribution accross your nodes, how will the data be distributed by username in this case?

You can read about difference between primary key and partition key here: http://stackoverflow.com/questions/24949676/difference-between-partition-key-composite-key-and-clustering-key-in-cassandra — grzesiekw, Feb 19 '16 at 20:39

Aaron · Accepted Answer · 2016-02-19T21:17:15.137

Actually, The PRIMARY KEY is not the same as the partition key. The partition key is a part of the PRIMARY KEY. And yes, it is the part which determines how a row is distributed across the cluster.

how will the data be distributed by username in this case?

If I CREATE your table, insert some values and query it I can get a bit of a window into the distribution process by using the token function in my SELECT:

> SELECT token(user_name), user_name FROM user2;

 system.token(user_name) | user_name
-------------------------+-----------
    -5077180869401877077 |   Patdard
    -4874582970682694928 |      Robo
     4639906948852899531 |      Bill
     4645660266327417866 |       Bob
     4877648712764681009 | Valentina
     5726383012007749221 |   Helcine
     7724711996172375448 |  Jebediah

(7 rows)

Let's assume that I have 5 nodes. In Cassandra each node is responsible for a primary token range. Let's assume the following:

1)  5534023222112865485 to -9223372036854775808
2) -9223372036854775807 to -5534023222112865485
3) -5534023222112865484 to -1844674407370955162
4) -1844674407370955161 to  1844674407370955161
5)  1844674407370955161 to  5534023222112865484

Note: Ranges computed by running:

python -c 'print [str(((2**64 / 5) * i) - 2**63) for i in range(5)]'

Also depicted this way in MVP Robbie Strickland's Cassandra High Availability.

Cassandra takes the hashed token value of the partition key (user_name in this case) and uses that to determine which node the row show be distributed to. Given the hashed token values above, and the ranges that I have listed out, these are the nodes which each user name should go to:

Node 1: Helcine, Jebediah
Node 3: Patdard, Robo
Node 5: Bill, Bob, Valentina

Depending on your replication factor (RF), Cassandra may also place additional replicas of each row on other nodes.

grzesiekw · Answer 2 · 2016-02-20T07:17:12.143

You can check where your data will be placed with nodetool getendpoints.

Below is simple example.

I'm using here ccm to create my cluster - https://github.com/pcmanus/ccm.

I will use your table users with the following keyspace configration:

CREATE KEYSPACE test_user WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

So there will be 3 replicas.

First I create cluster with 5 nodes:

> ccm create -v 3.2 -n 5 test

start them:

> ccm start

and check if my cluster is up and running:

> ccm status                                   

Cluster: 'test'
---------------
node1: UP
node3: UP
node2: UP
node5: UP
node4: UP

Now I can check where data will be placed with nodetool getendpoints:

> ccm node1 nodetool getendpoints test_user users john;    

127.0.0.1
127.0.0.2
127.0.0.3

'john' will be on 127.0.0.1, 127.0.0.2, 127.0.0.3.

> ccm node1 nodetool getendpoints test_user users tom; 

127.0.0.3
127.0.0.4
127.0.0.5

'tom' will be on 127.0.0.3, 127.0.0.4, 127.0.0.5.

How do partition keys work?

2 Answers2