Cassandra cluster key usage

Question

I'm banging my head on this, but, frankly speaking, my brains won't get it - or so it seems.

I have a column family that holds jobs for a rather large group of actors. It is a central job management and scheduling table that must be distributed and available throughout the whole cluster and possibly even traverses datacenter barriers some day in the near future.

Each job executor actor system, the ones that actually execute the jobs, is installed alongside one Cassandra node - that is, on the same node. Actually of course there is s master actor that pulls the jobs and distributes them to the actor agents, but that has nothing to do with my question.

There are also some actor systems that can create jobs in the central job table to be executed by other actors or even actor systems, but usually the jobs are loaded batch wise or manually through a web interface.

An actor that is to execute a job always only queries it's local cassandra node. If finished, it will update the job table to indicate it is finished. This write should, in normal circumstances, also only update records with jobs, for which his local Cassandra node is authoritative.

Now, sometimes it may happen that an actor system on a given host has nothing to do. In this case it should indeed get jobs from other nodes too, but of course it will still only talk to it's local Cassandra node. I know this works and it doesn't bother me a bit.

What keep me up at night is this:

How would I create a compound key to achieve the local authoritative of a Cassandra node for job entries for it's local actor system and thereby it's job execution actors, without splitting the job table in multiple column families or the like?

In other words: how can I create a compound key that makes sure that a) jobs are evenly distributed through my cluster and b) a local query on the job table only returns jobs for which this Cassandra node is authoritative and c) my distributed agent system still has the possibility to fetch jobs from other nodes, in case it has no own jobs to execute???

A last word on c) above. I do not want to do 2 queries in the case there is no local job, but still only on!

Any hints on this?

This is general structure of job table so far:

ClusterKey    UUID: Primary Key
JobScope    String: HOST / GLOBAL / SERVICE / CHANNEL
JobIdentifier    String: Web-Crawler, Twitter
Description    String: 
URL    String:
JobType    String: FETCH / CLEAN / PARSE /
Job    String: Definition of the job
AdditionalData    Collection: 
JobStatus      String: NEW / WORKING / FINISHED 
User    String: 
ValidFrom    Timestamp: 
ValidUntill    Collection:

Still in the process setting everything up, so no query so far defined. But an Actor will pull jobs out of it and set status and so

Can you edit your question with your schema (`CREATE TABLE`) and query statements? That will make it much easier to see what you're trying to do. — Aaron, Mar 14 '15 at 13:12

Chris Shain · Answer 1 · 2015-03-14T13:51:03.870

Cassandra has no way of "pinning" a key to a node, if that's what you are after.

If I were you, I'd stop worrying about whether my local node was authoritative for some set of data, and start leveraging the built-in consistency controls in Cassandra for managing the set of nodes that you read from or write to.

Lots of information here on read consistency and write consistency- using the right consistency will ensure that your application scales well while keeping it logically correct: http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html

Another item worth mentioning is atomic "compare and swap", also known as lightweight transactions. Let's say you want to ensure that a given job is only performed once. You could add a field indicating whether the job has been "picked up", then query on that field (where picked_up = 0) and simultaneously (and atomically) update the field to indicate that you are "picking up" that work. That way no other actors will pick it up again.

Info on lightweight transactions here: http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_ltwt_transaction_c.html

Cassandra cluster key usage

1 Answers1