Why Cassandra doesn't have secondary index?

Question

Cassandra is positioned as scalable and fast database. Why , I mean from technical details, above goals cannot be accomplished with secondary indexes?

score 2 · Accepted Answer · answered May 22 '19 at 13:41

2

Cassandra does indeed have secondary indexes. But secondary index usage doesn't work well with distributed databases, and it's because each node only holds a subset of the overall dataset.

I previously wrote an answer which discussed the underlying details of secondary index queries:

How do secondary indexes work in Cassandra?

While it should help give you some understanding of what's going on, that answer is written from the context of first querying by a partition key. This is an important distinction, as secondary index usage within a partition should perform well.

The problem is when querying only by a secondary index, that Cassandra cannot guarantee all of your data will be able to be served by a single node. When this happens, Cassandra designates a node as a coordinator, which in turn queries all other nodes for the specified indexed values.

Essentially, instead of performing sequential reads from a single node, secondary index usage forces Cassandra to perform random reads from all nodes. Now you don't have just disk seek time, but also network time complicating things.

The recommendation for Cassandra modeling, is to duplicate your data into new tables to support the desired query. This adds in some other complications with keeping data in-sync. But (when done correctly) it ensures that your queries can indeed be served by a single node. That's a tradeoff you need to make when building your model. You can have convenience or performance, but not both.

answered May 22 '19 at 13:41

Aaron

55,518
11
116
132

Thanks for the thorough answer! why random reads on every row? We could have rdbms-like secondary index , on every node(for every partition). Then it would be direct search by secondary index. Isn't it? – voipp May 22 '19 at 13:48
It doesn't have to look at every row, but it does have to look at the index storage on every node. The randomness comes into play because the indexing mechanism isn't sure if it'll find one, multiple, or *any* values on a specific node. But to be thorough, it has to look. The main problem is the uncertainty involved with it. When querying by partition key, Cassandra knows which node has the data, but it just can't discern that from a secondary index. – Aaron May 22 '19 at 13:57
Wanted to clarify one thing about cassandra: Suppose we have replication-factor as 2 . Thus some key "key1" will be copied from primary node to secondary node(replica-node for the key). During read operation, would cassandra load balance requests and retrieve key1 from both primary node and secondary one? – voipp May 23 '19 at 14:46
1

It depends on your consistency level and node availability. If you're reading at ONE and your nodes are all up, it'll go to the node responsible for the primary range. If that node is down, it'll find the secondary. If you're reading at QUORUM, it will read from both, as QUORUM of 2 is 2. – Aaron May 23 '19 at 14:54
But it cannot round-robin requests to primary node and secondary one. – voipp May 23 '19 at 14:56
If you're using the Token Aware load balancing policy, no. You have to enable replica shuffling for that. Otherwise, using one of the Round Robin policies should do that intrinsically. – Aaron May 23 '19 at 14:58

score 0 · Answer 2 · answered May 22 '19 at 17:15

So yes cassandra does have secondary indexes and aaron's explaination does a great job of explaining why.

You see many people trying to solve this issue by writing their data to multiple tables. This is done so they can be sure that the data they need to answer the query that would traditionally rely on a secondary index is on the same node.

Some of the recent iterations of cassandra have this 'built in' via materialized views. I've not really used them since 3.0.11 but they are promising. The problems i had at the time were primarily adding them to tables with existing data and they had a suprisingly large amount of overhead on write (increased latency).

Why Cassandra doesn't have secondary index?

2 Answers2