1

I am new to Cassandra and I read that the primary key is the same thing as the partition key.

My question is simple, in this case:

CREATE TABLE users (
  user_name varchar PRIMARY KEY,
  password varchar,
  gender varchar,
  session_token varchar,
  state varchar,
  birth_year bigint
);

As the partition key is responsible for data distribution accross your nodes, how will the data be distributed by username in this case?

Aaron
  • 55,518
  • 11
  • 116
  • 132
farhawa
  • 10,120
  • 16
  • 49
  • 91
  • You can read about difference between primary key and partition key here: http://stackoverflow.com/questions/24949676/difference-between-partition-key-composite-key-and-clustering-key-in-cassandra – grzesiekw Feb 19 '16 at 20:39
  • Please read my question again – farhawa Feb 19 '16 at 20:41

2 Answers2

3

Actually, The PRIMARY KEY is not the same as the partition key. The partition key is a part of the PRIMARY KEY. And yes, it is the part which determines how a row is distributed across the cluster.

how will the data be distributed by username in this case?

If I CREATE your table, insert some values and query it I can get a bit of a window into the distribution process by using the token function in my SELECT:

> SELECT token(user_name), user_name FROM user2;

 system.token(user_name) | user_name
-------------------------+-----------
    -5077180869401877077 |   Patdard
    -4874582970682694928 |      Robo
     4639906948852899531 |      Bill
     4645660266327417866 |       Bob
     4877648712764681009 | Valentina
     5726383012007749221 |   Helcine
     7724711996172375448 |  Jebediah

(7 rows)

Let's assume that I have 5 nodes. In Cassandra each node is responsible for a primary token range. Let's assume the following:

1)  5534023222112865485 to -9223372036854775808
2) -9223372036854775807 to -5534023222112865485
3) -5534023222112865484 to -1844674407370955162
4) -1844674407370955161 to  1844674407370955161
5)  1844674407370955161 to  5534023222112865484

Note: Ranges computed by running:

python -c 'print [str(((2**64 / 5) * i) - 2**63) for i in range(5)]'

Also depicted this way in MVP Robbie Strickland's Cassandra High Availability.

Cassandra takes the hashed token value of the partition key (user_name in this case) and uses that to determine which node the row show be distributed to. Given the hashed token values above, and the ranges that I have listed out, these are the nodes which each user name should go to:

Node 1: Helcine, Jebediah
Node 3: Patdard, Robo
Node 5: Bill, Bob, Valentina

Depending on your replication factor (RF), Cassandra may also place additional replicas of each row on other nodes.

Aaron
  • 55,518
  • 11
  • 116
  • 132
1

You can check where your data will be placed with nodetool getendpoints.

Below is simple example.

I'm using here ccm to create my cluster - https://github.com/pcmanus/ccm.

I will use your table users with the following keyspace configration:

CREATE KEYSPACE test_user WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

So there will be 3 replicas.

First I create cluster with 5 nodes:

> ccm create -v 3.2 -n 5 test

start them:

> ccm start

and check if my cluster is up and running:

> ccm status                                   

Cluster: 'test'
---------------
node1: UP
node3: UP
node2: UP
node5: UP
node4: UP

Now I can check where data will be placed with nodetool getendpoints:

> ccm node1 nodetool getendpoints test_user users john;    

127.0.0.1
127.0.0.2
127.0.0.3

'john' will be on 127.0.0.1, 127.0.0.2, 127.0.0.3.

> ccm node1 nodetool getendpoints test_user users tom; 

127.0.0.3
127.0.0.4
127.0.0.5

'tom' will be on 127.0.0.3, 127.0.0.4, 127.0.0.5.

grzesiekw
  • 477
  • 4
  • 8