Cassandra CQL Select count with LIMIT

Question

I created a simple tabe:

CREATE TABLE test (
  "type" varchar,
  "value" varchar,
  PRIMARY KEY(type,value)
);

I inserted 5 rows into it:

INSERT INTO test(type,value) VALUES('test','tag1')
INSERT INTO test(type,value) VALUES('test','tag2')
INSERT INTO test(type,value) VALUES('test','tag3')
INSERT INTO test(type,value) VALUES('test','tag4')
INSERT INTO test(type,value) VALUES('test','tag5')

I ran SELECT * from test LIMIT 3 and it works as expected.

 type | value
------+------
 test |  tag1
 test |  tag2
 test |  tag3

When I ran SELECT COUNT(*) from test LIMIT 3, it produces:

 count
-------
     5

Shouldn't it say 3?

The Datastax documentation seems to suggest that specifying a LIMIT will overwrite the default of 10,000. Why does it not work in this case? If it matters, I'm on Cassandra 2.2.5 and ran all the queries through cqlsh.

Update Both the Java driver and CQLSH have been tested to show that LIMIT indeed does not work as prescribed in the documentation. If there are any Datastax employees reading, your input would be greatly appreciated.

Ralf · Answer 1 · 2016-04-25T10:18:45.757

10

My spontaneous response to this was that a row count always only returns one row in its result set, stating the number of rows found. So any LIMIT greater than 1 would not have an effect.

But as @light correctly pointed out, the documentation states that the LIMIT should apply to a count(*). And with good reason too. According to this blog post Cassandra cannot source any meta data to come up with the number or rows, but has to inspect every partition (on every node) to get to the number. It thus is a very expensive operation.

However, contrary to the documentation, when querying C* 2.2.4 with cqlsh or with the Java driver (v3.0.0) the LIMIT clause has no effect on the reported number of rows. Neither has the default limit of cqlsh of 10'000 rows. Nor has a LIMIT greater than 10'000 if there are more than 10'000.

The documentation and implementation seem to be out of sync. Though which one is incorrect I cannot say.

EDIT

The ticket referenced by @Abhishek Anand concludes that the documentation is wrong. Not the behavior. So specifying a limit of 1 will count all your rows. And that is the desired behavior.

edited Apr 25 '16 at 10:18

answered Mar 23 '16 at 10:07

Ralf

6,735
3
16
32

If `SELECT COUNT` always returns 1 row, a `LIMIT` clause is useless in a `SELECT COUNT` query, isn't it? Why would the Datastax documentation specifically gave examples like `SELECT COUNT(*) FROM big_table LIMIT 50000;`? – light Mar 23 '16 at 10:12
@light, um, you are right. The documentation suggests that you should get 3 as a result of the count when combined with the LIMIT clause. This [article](http://www.planetcassandra.org/blog/counting-key-in-cassandra/) explains why applying the limit to a count is a good thing to have. I'll test on my instance (C* v 2.2.4). What is your C* version? – Ralf Mar 23 '16 at 10:24
I'm on Cassandra v2.2.5 – light Mar 23 '16 at 10:36
@light, that's a bummer. ;-) I was hoping you were on v3.x. The storage engine in that version was completely reworked. The potential presence of addtl. meta-data might have allowed them to remove the constraint of applying the LIMIT to count() queries. – Ralf Mar 23 '16 at 10:56
I'd probably upgrade to v3 when it becomes more stable and well-supported, i.e. the drivers for various languages get updated. Hopefully, along with even better documentation, so we won't have to scratch our heads at such "anomalies" :) – light Mar 23 '16 at 11:04
Thanks for the update. Perhaps we need to await clarification from an official source. In any case, the premise in your original answer that `LIMIT` does not apply to the count value in `SELECT COUNT` appears to be erroneous. For the benefit of future readers, I suggest editing that out. – light Mar 23 '16 at 12:15

score 10 · Answer 2 · answered Mar 28 '16 at 12:56

10

This is a Bug in cassandra and version 2.2.x is affected by it.

https://issues.apache.org/jira/browse/CASSANDRA-8216

They have marked it as fixed, but clearly this has propagated to version beyond fixed version.

Anyways, light, Your assumption/thinking is completely correct. Limit keyword has to be applied on cassandra's count(*), and it works as it should in the versions I am working on 3.2.4 and 2.1.x

answered Mar 28 '16 at 12:56

Abhishek Anand

1,940
14
27

I have cassandra [cqlsh 5.0.1 | Cassandra 3.6 | CQL spec 3.4.2 | Native protocol v4]. When I used this query - SELECT COUNT(*) from kwhhourlyconsumption LIMIT 3; I am getting wrong output that is 87 with warning- "Aggregation query used without partition key". Please let me know if I am missing anything. @Abhishek Anand – Tejas Ratunawar Oct 02 '19 at 06:32

score 2 · Answer 3 · answered Mar 23 '16 at 10:08

2

The limit clause is used to limitate the number of rows in the result. The count(*) return only a single row with the count of (in this case) total rows.

the "limit 3" does not affect the number of occurrences analyzed from the count(*), if you want this, you most be use a "where"

answered Mar 23 '16 at 10:08

mariettièllo

66
5

Thanks for the answer. If `SELECT COUNT` always returns 1 row, a `LIMIT` clause is useless in a `SELECT COUNT` query, isn't it? Why would the Datastax documentation specifically gave examples like `SELECT COUNT(*) FROM big_table LIMIT 50000;`? – light Mar 23 '16 at 10:12
you refer to this documentation? --> "SELECT COUNT() FROM big_table LIMIT 50000; SELECT COUNT() FROM big_table LIMIT 200000; The output of these statements if you had 105,291 rows in the database would be: 50000 and 105291" I think they have confused – mariettièllo Mar 23 '16 at 10:32
so the Datastax documentation is erroneous? – light Mar 23 '16 at 10:36
The answer from this question also suggests that the use of `LIMIT` applies to `SELECT COUNT`: http://stackoverflow.com/questions/8795923/wrong-count-with-cassandra-cql?rq=1 – light Mar 23 '16 at 10:38
I was not aware, but the examples show that you can only use this function to increase the limit, not to decrease. Perhaps that is why in your example does not work – mariettièllo Mar 23 '16 at 10:46
Perhaps, but it's not exactly clear in the documentation, or anywhere else I can find. In any case, the premise that `LIMIT` does not work with `SELECT COUNT` does not seem to be correct, which your answer seems to imply. – light Mar 23 '16 at 10:50

Cassandra CQL Select count with LIMIT

3 Answers3

Linked