Distributed pagination in Cassandra

Question

I was searching for pagination in cassandra and found this perfect topic here: Results pagination in Cassandra (CQL) , with this answer accepted by majority of people. But I want to do same thing on multiple computers. I'll provide an example...

The problem

Lets say I have three computers that are connected to same cassandra DB. Each computer wants to take a few rows from the following table:

CREATE TABLE IF NOT EXISTS lp_webmap.page (
    domain_name1st text,
    domain_name2nd text,
    domain_name3rd text,
    location text,
    title text,
    rank float,
    updated timestamp,
    PRIMARY KEY (
        (domain_name1st, domain_name2nd, domain_name3rd), location
    )
);

Every computer takes few rows and performs time consuming calculations for them. For a fixed partition key (domain_name1st, domain_name2nd, domain_name3rd) and different clustering key (location), there can be still thousands of results.

And now the problem comes...how to lock quickly a couple of rows with that computer1 is working for other computers?

Unusable solution

In a standard SQL I would use something like this:

CREATE TABLE IF NOT EXISTS lp_registry.page_lock (
    domain_name1st text,
    domain_name2nd text,
    domain_name3rd text,
    page_from int,
    page_count int,
    locked timestamp,
    PRIMARY KEY (
        (domain_name1st, domain_name2nd, domain_name3rd), locked, page_from
    )
) WITH CLUSTERING ORDER BY (locked DESC);

This would allow me to do following:

Select first 10 pages on computer 1 and lock them (page_from=1, page_count=10)
Check locks quickly on other two machines and get unused pages for calculations
Take and lock bigger amount of pages on faster computers
Delete all locks for given partition key after all pages are processed

Question

However, I can't do LIMIT 20,10 in Cassandra and also I can't do this, since I want to paginate on different computers. Is there any chance how can I paginate through these pages quickly?

I would not know how to do what you are asking. But have you considered alternative approaches? Like using one Cassandra client to query for the rows that need processing, putting them on a queue and have all clients consume rows from the queue? That way you can control the number of rows being processed and you get load-balancing between clients. — Ralf, Feb 14 '16 at 10:32
I don't understand your requirement, can you explain it with more details ? Why is locking needed to paginate through results ? — doanduyhai, Feb 14 '16 at 11:12
@doanduyhai ... Why is locking needed? Because I don't want two different computers to process same rows. — Michal, Feb 14 '16 at 11:50
If you want to **partition** the processing of your table by different computers. What you can do is to use the existing partition keys. For example, you can decide that *computer1* can only process a list of (domain_name1st, domain_name2nd, domain_name3rd) and *computer2* only processes a different combination of domain names. — doanduyhai, Feb 14 '16 at 12:35
@doanduyhai then there'll has to be an extra logic that will instruct every particular computer which partition keys use and which not. Since those nodes (computers) are independent and i want add an unspecific number of nodes in the future, i am afraid, I won't be able to accomplish this. But thanks for advice. — Michal, Feb 14 '16 at 14:24
There is an easy solution for independence of computer partitioning. Use consistent hashing. For example, each computer is assigned an order (1, 2, ... N). Then for each tuple of (domain_name1st, domain_name2nd, domain_name3rd), compute a hash of 3 values, it'll give you an integer. Then compute a modulo of this hash value against N (N = total number of computers). If hash%N = 2 then it is computer with number 2 which is responsible for that partition — doanduyhai, Feb 14 '16 at 14:29

Distributed pagination in Cassandra

0 Answers0