Why are super columns in Cassandra no longer favoured?

Question

I have read in the latest release that super columns are not desirable due to "performance issues", but no where is this explained.

Then I read articles such as this one that give wonderful indexing patterns using super columns.

This leave me with no idea of what is currently the best way to do indexing in Cassandra.

What are the performance issues of super columns?
Where can I find current best practices for indexing?

That is an excellent question. I think this ebay tech blog have a nice and low tech (no much tech details) overview of an optimized architecture. http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/ However, if you are into the real stuff, better read each changelog and roadmap so you get a better feel where and what are the problems and how are they being solved. It is too much reading and it would be nice if it could be systemized somewhere, but I also can't find much on the internet. — Nikola Yovchev, Aug 11 '12 at 14:03

score 34 · Accepted Answer · edited Nov 08 '17 at 18:38

Super columns suffer from a number of problems, not least of which is that it is necessary for Cassandra to deserialze all of the sub-columns of a super column when querying (even if the result will only return a small subset). As a result, there is a practical limit to the number of sub-columns per super column that can be stored before performance suffers.

In theory, this could be fixed within Cassandra by properly indexing sub-columns, but consensus is that composite columns are a better solution, and they work without the added complexity.

The easiest way to make use of composite columns is to take advantage of the abstraction that CQL 3 provides. Consider the following schema:

CREATE TABLE messages(
    username text,
    sent_at timestamp,
    message text,
    sender text,
    PRIMARY KEY(username, sent_at)
);

Username here is the row key, but we've used a PRIMARY KEY definition which creates a grouping of row key and the sent_at column. This is important as it has the effect of indexing that attribute.

INSERT INTO messages (username, sent_at, message, sender) VALUES ('bob', '2012-08-01 11:42:15', 'Hi', 'alice');
INSERT INTO messages (username, sent_at, message, sender) VALUES ('alice', '2012-08-01 11:42:37', 'Hi yourself', 'bob');
INSERT INTO messages (username, sent_at, message, sender) VALUES ('bob', '2012-08-01 11:43:00', 'What are you doing later?', 'alice');
INSERT INTO messages (username, sent_at, message, sender) VALUES ('bob', '2012-08-01 11:47:14', 'Bob?', 'alice');

Behind the scenes Cassandra will store the above inserted data something like this:

alice: (2012-08-01 11:42:37,message): Hi yourself, (2012-08-01 11:42:37,sender): bob
bob:   (2012-08-01 11:42:15,message): Hi,          (2012-08-01 11:42:15,sender): alice, (2012-08-01 11:43:00,message): What are you doing later?, (2012-08-01 11:43:00,sender): alice (2012-08-01 11:47:14,message): Bob?, (2012-08-01 11:47:14,sender): alice

But using CQL 3, we can query the "row" using a sent_at predicate, and get back a tabular result set.

SELECT * FROM messages WHERE username = 'bob' AND sent_at > '2012-08-01';
 username | sent_at                  | message                   | sender
----------+--------------------------+---------------------------+--------
      bob | 2012-08-01 11:43:00+0000 | What are you doing later? |  alice
      bob | 2012-08-01 11:47:14+0000 |                      Bob? |  alice

Thanks! Talking of composite keys, is Cassandra able to efficiently perform range queries on each column? SELECT * FROM somewhere WHERE a > 3 AND a <= 12 AND b IN (1, 3, 6) AND c > 17 etc., assuming the key is a, b, c. — IamIC, Aug 14 '12 at 01:00
Or is a composite column name (multi-component) better for this case? — IamIC, Aug 14 '12 at 01:16
could you look at this please: http://stackoverflow.com/questions/11978386/cassandra-1-1-storage-engine — IamIC, Aug 16 '12 at 00:03
Interesting. I want to use Primary Key for a GUID for direct access, but this composite key seems like something I would like to use as well. Will it work the same with a simple INDEX and not primary key? — Maurice Klimek, May 27 '21 at 09:58

Why are super columns in Cassandra no longer favoured?

1 Answers1

Linked