Partition keys update

Question

using cassandra 2.2.8

my challenge is as follows. In my database we've bunch of tables with millions of rows. Unfortunately, due to loose design partition keys on few tables have grown in Gigabytes in size - this is causing negative pressure on system and issues like jvm out-of-memory/node crashes are happening.

we need to redesign the partition keys on few tables. we've data in tables that we would need to retain/or migration to new table.

I'm looking for solution that enables me to export data from source table to target table (i.e. with re-designed composite partitioned keys); I hope this would help spread partitions in more balance manner.

I tried to use COPY[tablename(column1,column2...)] command but that is probing number of nodes , causing pressure on system/heap i.e. affecting application. I'm seeking guidance here as how best i can address this challenge - thank you in advance for any help.

score 1 · Answer 1 · edited May 23 '17 at 12:32

Since you have very big tables and already failed at using COPY, you must export and import your data manually. To perform such task you need to use the TOKEN function.

With some small client code you can write queries to perform a full table data extraction with something like:

SELECT * FROM mytable WHERE token(pk) >= MIN_TOKEN AND TOKEN(pk) < MIN_TOKEN + QUERY_INTERVAL;
SELECT * FROM mytable WHERE token(pk) >= MIN_TOKEN + QUERY_INTERVAL AND TOKEN(pk) < MIN_TOKEN + 2*QUERY_INTERVAL;
....
SELECT * FROM mytable WHERE token(pk) >= MAX_TOKEN - QUERY_INTERVAL AND TOKEN(pk) < MIN_TOKEN;

where MIN_TOKEN and MAX_TOKEN are both the constant minimum and the maximum token value of your cluster partitioner, and QUERY_INTERVAL is the range window you want to query. The bigger the QUERY_INTERVAL, the more data you will fetch in a single query (and more likely trigger a timeout).

Please note that Cassandra never allows a range operator (> >= <= <) in the WHERE clause on partition key column specifiers. The exception is with the use of the TOKEN function.

I also suggest these readings:

Thankyou @xmas79 for your reply. Please pardon me on my lack of experience and if i'm missing something obvious. While combing internet in order to find answers, i learnt sstableloader is more efficient option to load massive volume of data - in particular the link [http://techblogsearch.com/a/bulk-loading-external-data-to-cassandra.html] suggests if I create sudo network , i may be able to migrate data from one table to another using sstableloader. I'm looking for working example to see how i can make it work! does it worth giving a try or i'm going wrong way? — Asad, Apr 04 '17 at 00:11

score 0 · Answer 2 · edited May 23 '17 at 12:25

COPY just import/export to/from a file. If you want to redesign you data model, it probably will be better to implement specialized tool for your task, which will:

read data from source table by portions (e.g. by Tokens as @xmas79 discribed above)
transform the data portion to new model
write the data portion to new tables

Here is an example how to read big tables by token ranges with java and datastax driver

Partition keys update

2 Answers2