1

I've been doing a lot of reading lately on Cassandra data modelling and best practices.

What escapes me is what the best practice is for choosing a partition key if I want an application to page through results via the token function.

My current problem is that I want to display 100 results per page in my application and be able to move on to the next 100 after.

From this post: https://stackoverflow.com/a/24953331/1224608 I was under the impression a partition key should be selected such that data spreads evenly across each node. That is, a partition key does not necessarily need to be unique.

However, if I'm using the token function to page through results, eg:

SELECT * FROM table WHERE token(partitionKey) > token('someKey') LIMIT 100;

That would mean that the number of results returned from my partition may not necessarily match the number of results I show on my page, since multiple rows may have the same token(partitionKey) value. Or worse, if the number of rows that share the partition key exceeds 100, I will miss results.

The only way I could guarantee 100 results on every page (barring the last page) is if I were to make the partition key unique. I could then read the last value in my page and retrieve the next query with an almost identical query:

SELECT * FROM table WHERE token(partitionKey) > token('lastKeyOfCurrentPage') LIMIT 100;

But I'm not certain if it's good practice to have a unique partition key for a complex table.

Any help is greatly appreciated!

BurnerBoy
  • 93
  • 2
  • 9

2 Answers2

2

But I'm not certain if it's good practice to have a unique partition key for a complex table.

It depends on requirement and Data Model how you should choose your partition key. If you have one key as partition key it has to be unique otherwise data will be upsert (overridden with new data). If you have wide row (a clustering key), then make your partition key unique (a key that appears once in a table) will not serve the purpose of wide row. In CQL “wide rows” just means that there can be more than one row per partition. But here there will be one row per partition. It would be better if you can provide the schema.

Please follow below link about pagination of Cassandra.

You do not need to use tokens if you are using Cassandra 2.0+. Cassandra 2.0 has auto paging. Instead of using token function to create paging, it is now a built-in feature.

Results pagination in Cassandra (CQL)

https://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0 https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/

Saving and reusing the paging state

You can use pagingState object that represents where you are in the result set when the last page was fetched.

EDITED:

Please check the below link:

Paging Resultsets in Cassandra with compound primary keys - Missing out on rows

Chaity
  • 1,348
  • 13
  • 20
  • I do have a clustering key. As a really simplified version, the table describe is `CREATE TABLE table (unique_val text, resource_owner text, PRIMARY KEY (unique_val, resource_owner));` Due to some limitations I can't take advantage of the built-in paging. I also require being able to go previous and forward so I was using token value as a lesser offset (though I don't need page-jumping in my requirement). – BurnerBoy Mar 05 '18 at 19:32
  • If ```unique_val``` is unique thus one row per partition, then clustering key is unnecessary. you can use ```resource_owner``` as meta-data not as key. I don't know your requirements so can't suggest the data model. But here you can use token() and no row will be missed. What limitations do you have? Using token() cause some unnecessary complexity in the client code and inconvenience to developers which is handled by the driver. I have provided a link with same issue (If you have wide row) I think you can check this out. Check the edited answer. – Chaity Mar 06 '18 at 04:44
  • I checked out the link, but using `ALLOW FILTERING` is against my requirement. I also don't have much freedom in using the driver as there is a layer that I need to go through before hitting the Java driver, so I can't rely on the built-in paging unfortunately. My requirement is really just that I need to be able to page forward and backward across a result set and ensure that the data is stable for each page. Hence why I was wondering about the partition key and whether or not it should be unique since my idea was to use TOKEN() for paging. – BurnerBoy Mar 15 '18 at 22:34
0

I recently did a POC for a similar problem. Maybe adding this here quickly.

First there is a table with two fields. Just for illustration we use only few fields.

1.Say we insert a million rows with this

Along comes the product owner with a (rather strange) requirement that we need to list all the data as pages in the GUI. Assuming that there are hundred entries 10 pages each.

  1. For this we update the table with a column called page_no.
  2. Create a secondary index for this column.
  3. Then do a one time update for this column with page numbers. Page number 10 will mean 10 contiguous rows updated with page_no as value 10.
  4. Since we can query on a secondary index each page can be queried independently.

Code is self explanatory and here - https://github.com/alexcpn/testgo

Note caution on how to use secondary index properly abound. Please check it. In this use case I am hoping that i am using it properly. Have not tested with multiple clusters.

"In practice, this means indexing is most useful for returning tens, maybe hundreds of results. Bear this in mind when you next consider using a secondary index." From http://www.wentnet.com/blog/?p=77

Alex Punnen
  • 5,287
  • 3
  • 59
  • 71
  • this is very bad practice. secondary indexes work good only if you provide a partition key as well - otherwise it need to hit all nodes looking for data. Plus, with such page size you'll get a problem with high cardinality of values in index, that will also lead to performance problems – Alex Ott Jun 24 '19 at 08:37
  • page size can be adjusted so that all page_no's fall in one node. In this case the pariition key was chosen poorly or data was modelled poorly . But his happens. Cannot use a parition key here. We have four nodes only so if should not be that problematic. Using Cassandra as store and Elastic for indexing - I found this question and some answers to it more interesting - https://stackoverflow.com/questions/27054954/elasticsearch-vs-cassandra-vs-elasticsearch-with-cassandra . Maybe that is the best way – Alex Punnen Jun 24 '19 at 08:57