How to obtain number of rows in Cassandra table

Question

This is a super basic question but it's actually been bugging me for days. Is there a good way to obtain the equivalent of a COUNT(*) of a given table in Cassandra?

I will be moving several hundreds of millions of rows into C* for some load testing and I'd like to at least get a row count on some sample ETL jobs before I move massive amounts of data over the network.

The best idea I have is to basically loop over each row with Python and auto increment a counter. Is there a better way to determine (or even estimate) the row size of a C* table? I've also poked around Datastax Ops Center to see if I can determine the row size there. If you can, I don't see how it's possible.

Anyone else needed to get a count(*) of a table in C*? If so, how'd you go about doing it?

see http://stackoverflow.com/questions/1951843/row-count-of-a-column-family-in-cassandra — Amit G, Apr 27 '17 at 07:16

score 73 · Accepted Answer · edited Jun 20 '20 at 09:12

73

Yes, you can use COUNT(*). Here's the documentation.

A SELECT expression using COUNT(*) returns the number of rows that matched the query. Alternatively, you can use COUNT(1) to get the same result.

Count the number of rows in the users table:
SELECT COUNT(*) FROM users;

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 29 '14 at 00:04

catpaws

2,263
16
18

2

I'm curiius to hear back how long your count takes. Maybe you can report back. :) – Don Branson Oct 29 '14 at 00:31
81

It was your turn. We all get a turn at being the idiot. Mine's coming up tomorrow. Can't wait. – Don Branson Oct 29 '14 at 01:24
5

in terms of benchmarks, right now I'm running a 5 node cluster with SSDs and 32 GB of ram. For a million rows, it's taking about a minute and a half to return that count *. I got a lot of tweaking and fine tuning to do. But in terms of where it is right now, "straight out of the box" if you will, it's definitely a bit on the slow side. – Evan Volgas Oct 29 '14 at 14:58
3

So that seems to point back to the postrelational mantra - design for your queries. Create a table that keeps a counter, and bump the count whenever you add a row. Then the query will come back quickly. I'd expect it to be on the order of a few milliseconds. – Don Branson Oct 29 '14 at 15:10
4

If I needed to execute count(*)s against my tables often at all, I'd certainly agree with you. This is just sanity checking and control totals. "I sent it a million records... did all of them arrive?" kind of thing – Evan Volgas Oct 29 '14 at 15:30
3

Generally speaking, running aggregate queries in Cassandra is not good practice even though it is technically supported. Also, if you end up with millions of rows this query will likely timeout. Using some of the nodetool commands below is generally a better approach. – ammills01 Aug 11 '17 at 14:25
i don't know of one. – catpaws Sep 17 '19 at 23:31

score 22 · Answer 2 · answered Aug 23 '18 at 11:53

22

You can use copy to avoid cassandra timeout usually happens on count(*)

cqlsh -e "copy keyspace.table_name (first_partition_key_name) to '/dev/null'" | sed -n 5p | sed 's/ .*//'

answered Aug 23 '18 at 11:53

Shubham

287
2
7

Nice! This is helpful – user1401472 Mar 23 '19 at 17:08
3

@shubham can you please explain the code. What this code is doing. Is there any side-effect? – Anurag Rana Dec 24 '21 at 06:50

score 18 · Answer 3 · answered Jul 13 '17 at 14:06

18

nodetool tablestats can be pretty handy for quickly getting row estimates (and other table stats).

nodetool tablestats <keyspace.table> for a specific table

answered Jul 13 '17 at 14:06

Lucian Thorr

1,997
1
21
29

1

I don't see an estimated count using tablestats, COPY seems a good option but for larger table(with less free storage left on disk) it's a bit difficult to manage space. any other way?? – Anil Kapoor Oct 14 '19 at 13:18
I believe originally I thought "Number of keys" was sufficient but I see now that it's referring to partition keys. You could possibly try using Memtable Cell Count if you know the column size. https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/toolsTablestats.html – Lucian Thorr Oct 15 '19 at 14:16
1

I don't see row estimates under `nodetool tablestats` – IceTea Dec 03 '21 at 17:40

score 11 · Answer 4 · answered Jun 26 '20 at 18:44

11

You can use dsbulk count for retrieving the total count of the table. I struggled with all above mentioned command with read timeout, and finally able to obtain count using below command

e.g,

dsbulk count -k <keyspace_name> -t <table_name>

More information about dsbulk can be found here

answered Jun 26 '20 at 18:44

Pritish Shah

611
3
11
25

great tip! This was super helpful! – chris.w.mclean Mar 26 '21 at 18:58

phact · Answer 5 · 2016-08-27T01:00:43.300

10

You can also get some estimates from nodetool cfhistograms if you don't need an exact count (these values are estimates).

You can also use spark if you're running DSE.

edited Aug 27 '16 at 01:00

answered Oct 30 '14 at 19:29

phact

7,305
23
27

How can you use Spark to make the count faster ? For example I have about billion records and I am using spark to count the number of rows. It took me 1hr 30 mins to complete it. Here are more details http://stackoverflow.com/questions/40778936/apache-spark-sql-is-taking-forever-to-count-billion-rows-from-cassandra/40781384?noredirect=1#comment68807487_40781384 – user1870400 Nov 24 '16 at 18:38
There is no `cfhistograms` option now and the `tablehistograms` does not seem to offer much useful info in terms of number of rows. – Alexis Wilke Dec 30 '18 at 02:12
cfhistograms and tablehistograms are the same thing, like I said you get some general estimates, not a strict count. – phact Dec 31 '18 at 01:37

score 5 · Answer 6 · answered Oct 26 '18 at 09:32

5

$nodetool settimeout read 360000
cqlsh -e "SELECT COUNT(*) FROM table;" --request-timeout=3600

answered Oct 26 '18 at 09:32

Aleksandr M.

69
1
3

10

While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. – Tiago Martins Peres Oct 26 '18 at 09:52

Alexis Wilke · Answer 7 · 2020-02-27T21:11:54.153

I've been working with Elasticsearch and this can be an answer to this problem... Assuming you are willing to use Elassandra instead of Cassandra.

The search system maintains many statistics and within seconds of the last updates it should have a good idea of how many rows you have in a table.

Here is a Match All Query request that gives you the information:

curl -XGET \
     -H 'Content-Type: application/json' \
     "http://127.0.0.1:9200/<search-keyspace>/_search/?pretty=true" \
     -d '{ "size": 1, "query": { "match_all": {} } }'

Where the <search-keyspace> is a keyspace that Elassandra creates. It generally is named something like <keyspace>_<table>, so if you have a keyspace named foo and a table named bar in that keyspace, the URL will use .../foo_bar/.... If you want to get the total number of rows in all your tables, then just use /_search/.

The output is a JSON which looks like this:

{
  "took" : 124,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 519659,                <-- this is your number
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "foo_bar",
        "_type" : "content",
        "_id" : "cda683e5-d5c7-4769-8e2c-d0a30eca1284",
        "_score" : 1.0,
        "_source" : {
          "date" : "2018-12-29T00:06:27.710Z",
          "key" : "cda683e5-d5c7-4769-8e2c-d0a30eca1284"
        }
      }
    ]
  }
}

And in terms of speed, this takes milliseconds, whatever the number of rows. I have tables with many millions of rows and it works like a charm. No need to wait hours or anything like that.

As others have mentioned, Elassandra is still a system heavily used in parallel by many computers. The counters will change quickly if you have many updates all the time. So the numbers you get from Elasticsearch are correct only if you prevent further updates for long enough for the counters to settle. Otherwise it's always going to be an approximate result.

score 2 · Answer 8 · edited May 09 '19 at 11:22

2

For count(*) for big tables, you can use Presto on top of Cassandra. I have tested and it works good.

Please refer below URL for the same: Key Word search: Cassandra question v3.11.3 …

select count(*) from table1

URL: Cassandra question v3.11.3 ... select count(*) from table1

edited May 09 '19 at 11:22

Adrita Sharma

21,581
10
69
79

answered May 09 '19 at 10:37

Hareesha

125
9

score 1 · Answer 9 · answered Aug 28 '18 at 02:09

1

For those using the C# Linq Component Adapter you can use:

var t = new Table<T>(session);
var count = t.Count().Execute();

answered Aug 28 '18 at 02:09

Crunchy234

1,887
2
16
21

1

what is the equivalent for java driver? Would you know? – Manu Chadha Oct 18 '19 at 08:00

score -2 · Answer 10 · answered Sep 18 '20 at 15:41

-2

Consider using ALLOW FILTERING with column constraints and then sum values.

For example:

SELECT count(*)
FROM my_table
WHERE datetime_id >= '2020-09-16' ALLOW FILTERING;

SELECT count(*)
FROM my_table
WHERE datetime_id < '2020-09-16' ALLOW FILTERING;

answered Sep 18 '20 at 15:41

prost0

19
5

4

please don't do that - you'll kill your cluster or it will simply timeout – Alex Ott Sep 18 '20 at 16:13

score -4 · Answer 11 · answered Aug 14 '18 at 09:38

-4

nodetool cfstats | grep -A 1000 KEYSPACE

Replace KEYSPACE for getting details of all tables in that KEYSPACE

answered Aug 14 '18 at 09:38

Vaibhav

71
1
5

How to obtain number of rows in Cassandra table

11 Answers11

Linked

Related