Cassandra: select only latest rows

Question

I work with following table:

CREATE TABLE IF NOT EXISTS lp_registry.domain (
    ownerid text,
    name1st text,
    name2nd text,
    name3rd text,
    registrar text,
    registered timestamp,
    expiration timestamp,
    updated timestamp,
    technologies list<text>,
    techversions list<text>,
    ssl boolean,
    PRIMARY KEY (
        (name1st, name2nd, name3rd), 
        registrar, ownerid, registered, expiration, updated
    )
);

Table isn't updated only new rows are added. Everytime crawler checks domain, new row is added.

I am performing this select:

SELECT * FROM lp_registry.domain WHERE 
    registrar = 'REG-WEDOS' AND 
    ownerid = 'FORPSI-JAF-S497436' 
ALLOW FILTERING;

But what I want in the result are only the rows with latest 'updated' value for each unique "name3rd.name2nd.name1st".

If I were in a standard SQL database, I would use nested select with MAX or GROUP BY. However, this is not supported by Cassandra (MAX(), DISTINCT and group by in Cassandra). But what I should do in CQL?

score 3 · Answer 1 · edited May 23 '17 at 12:15

3

The whole schema should be modified. The SELECT you're doing, apparently an important one from your application point of view should not require ALLOW FILTERING: you should denormalize your data and create a table where registrar and ownerid are partition keys.

In that denormalized structure, updated should be a partition key, sorted with DESC. The query would then be

SELECT * FROM lp_registry.domain WHERE registrar='XXX' AND ownerid='YYY' LIMIT 10;

As you're saying that rows are inserted but never updated, it should not be complicated in your application to insert new data in, if necessary, more than one denormalized tables.

Andy's answer provides more details and an example for your table structure.

edited May 23 '17 at 12:15

Community

1
1

answered Jan 02 '16 at 12:53

Cedric H.

7,980
10
55
82

Thank you very much for your reply and advice. Sorry for my question, but I can't see the trick which gets me only latest rows for unique domain names. The example will select 10 lately updated domains for given registrar and owner. But how will I achieve their uniquecity? – Michal Jan 02 '16 at 13:05
@Michal I'm not sure I understand your question/comment but I'll try to update my answer later today. – Cedric H. Jan 02 '16 at 13:20

score 3 · Accepted Answer · edited May 23 '17 at 11:59

Extending onto Cedric's answer (which is great advice and would consider that as the answer to accept) you would get a table structure roughly like:

CREATE TABLE IF NOT EXISTS lp_registry.domain (
    ownerid text,
    name1st text,
    name2nd text,
    name3rd text,
    registrar text,
    registered timestamp,
    expiration timestamp,
    updated timestamp,
    technologies list<text>,
    techversions list<text>,
    ssl boolean,
    PRIMARY KEY ((registrar, ownerid), updated, name1st, name2nd, name3rd)
) WITH CLUSTERING ORDER BY (updated desc);

When selecting data it will return rows with the most recent updated values within the partition for the registrar and ownerid you are querying.

This query would be incredibly fast because your data will be organized on disk by registrar, owner id with rows in order by updated descending.

This is a key concept with cassandra in that your data is organized based on how you query it. You lose flexibility in your queries, but you can feel comfortable that you are going to get great performance because you are retrieving data as it is organized. This is why denormalizing your data based on your queries is vital.

Where things become complicated is if you wanted to retrieve the most recently updated of all data. That problem is not easily solvable with cassandra unless everything shares the same partition which has its own set of problems (example strategy using a 'dummy' partition key).

Thank you Andy. So it is impossible to do something like this in cassandra http://stackoverflow.com/questions/22889722/mysql-select-distinct-but-latest-row ? — Michal, Jan 03 '16 at 10:34
Probably not without putting everything in the same partition. Alternatively you could use something like SparkSQL with the spark cassandra connector to make more flexible queries. — Andy Tolbert, Jan 03 '16 at 20:17

Cassandra: select only latest rows

2 Answers2