66

This is a super basic question but it's actually been bugging me for days. Is there a good way to obtain the equivalent of a COUNT(*) of a given table in Cassandra?

I will be moving several hundreds of millions of rows into C* for some load testing and I'd like to at least get a row count on some sample ETL jobs before I move massive amounts of data over the network.

The best idea I have is to basically loop over each row with Python and auto increment a counter. Is there a better way to determine (or even estimate) the row size of a C* table? I've also poked around Datastax Ops Center to see if I can determine the row size there. If you can, I don't see how it's possible.

Anyone else needed to get a count(*) of a table in C*? If so, how'd you go about doing it?

Evan Volgas
  • 2,900
  • 3
  • 19
  • 30

11 Answers11

73

Yes, you can use COUNT(*). Here's the documentation.

A SELECT expression using COUNT(*) returns the number of rows that matched the query. Alternatively, you can use COUNT(1) to get the same result.

Count the number of rows in the users table:

SELECT COUNT(*) FROM users;
Community
  • 1
  • 1
catpaws
  • 2,263
  • 16
  • 18
  • 2
    I'm curiius to hear back how long your count takes. Maybe you can report back. :) – Don Branson Oct 29 '14 at 00:31
  • 81
    It was your turn. We all get a turn at being the idiot. Mine's coming up tomorrow. Can't wait. – Don Branson Oct 29 '14 at 01:24
  • 5
    in terms of benchmarks, right now I'm running a 5 node cluster with SSDs and 32 GB of ram. For a million rows, it's taking about a minute and a half to return that count *. I got a lot of tweaking and fine tuning to do. But in terms of where it is right now, "straight out of the box" if you will, it's definitely a bit on the slow side. – Evan Volgas Oct 29 '14 at 14:58
  • 3
    So that seems to point back to the postrelational mantra - design for your queries. Create a table that keeps a counter, and bump the count whenever you add a row. Then the query will come back quickly. I'd expect it to be on the order of a few milliseconds. – Don Branson Oct 29 '14 at 15:10
  • 4
    If I needed to execute count(*)s against my tables often at all, I'd certainly agree with you. This is just sanity checking and control totals. "I sent it a million records... did all of them arrive?" kind of thing – Evan Volgas Oct 29 '14 at 15:30
  • 3
    Generally speaking, running aggregate queries in Cassandra is not good practice even though it is technically supported. Also, if you end up with millions of rows this query will likely timeout. Using some of the nodetool commands below is generally a better approach. – ammills01 Aug 11 '17 at 14:25
  • i don't know of one. – catpaws Sep 17 '19 at 23:31
22

You can use copy to avoid cassandra timeout usually happens on count(*)

cqlsh -e "copy keyspace.table_name (first_partition_key_name) to '/dev/null'" | sed -n 5p | sed 's/ .*//'

Shubham
  • 287
  • 2
  • 7
18

nodetool tablestats can be pretty handy for quickly getting row estimates (and other table stats).

nodetool tablestats <keyspace.table> for a specific table

Lucian Thorr
  • 1,997
  • 1
  • 21
  • 29
  • 1
    I don't see an estimated count using tablestats, COPY seems a good option but for larger table(with less free storage left on disk) it's a bit difficult to manage space. any other way?? – Anil Kapoor Oct 14 '19 at 13:18
  • I believe originally I thought "Number of keys" was sufficient but I see now that it's referring to partition keys. You could possibly try using Memtable Cell Count if you know the column size. https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/toolsTablestats.html – Lucian Thorr Oct 15 '19 at 14:16
  • 1
    I don't see row estimates under `nodetool tablestats` – IceTea Dec 03 '21 at 17:40
11

You can use dsbulk count for retrieving the total count of the table. I struggled with all above mentioned command with read timeout, and finally able to obtain count using below command

e.g,

dsbulk count -k <keyspace_name> -t <table_name>

More information about dsbulk can be found here

Pritish Shah
  • 611
  • 3
  • 11
  • 25
10

You can also get some estimates from nodetool cfhistograms if you don't need an exact count (these values are estimates).

You can also use spark if you're running DSE.

phact
  • 7,305
  • 23
  • 27
  • How can you use Spark to make the count faster ? For example I have about billion records and I am using spark to count the number of rows. It took me 1hr 30 mins to complete it. Here are more details http://stackoverflow.com/questions/40778936/apache-spark-sql-is-taking-forever-to-count-billion-rows-from-cassandra/40781384?noredirect=1#comment68807487_40781384 – user1870400 Nov 24 '16 at 18:38
  • There is no `cfhistograms` option now and the `tablehistograms` does not seem to offer much useful info in terms of number of rows. – Alexis Wilke Dec 30 '18 at 02:12
  • cfhistograms and tablehistograms are the same thing, like I said you get some general estimates, not a strict count. – phact Dec 31 '18 at 01:37
5
$nodetool settimeout read 360000
cqlsh -e "SELECT COUNT(*) FROM table;" --request-timeout=3600
Aleksandr M.
  • 69
  • 1
  • 3
  • 10
    While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. – Tiago Martins Peres Oct 26 '18 at 09:52
4

I've been working with Elasticsearch and this can be an answer to this problem... Assuming you are willing to use Elassandra instead of Cassandra.

The search system maintains many statistics and within seconds of the last updates it should have a good idea of how many rows you have in a table.

Here is a Match All Query request that gives you the information:

curl -XGET \
     -H 'Content-Type: application/json' \
     "http://127.0.0.1:9200/<search-keyspace>/_search/?pretty=true" \
     -d '{ "size": 1, "query": { "match_all": {} } }'

Where the <search-keyspace> is a keyspace that Elassandra creates. It generally is named something like <keyspace>_<table>, so if you have a keyspace named foo and a table named bar in that keyspace, the URL will use .../foo_bar/.... If you want to get the total number of rows in all your tables, then just use /_search/.

The output is a JSON which looks like this:

{
  "took" : 124,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 519659,                <-- this is your number
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "foo_bar",
        "_type" : "content",
        "_id" : "cda683e5-d5c7-4769-8e2c-d0a30eca1284",
        "_score" : 1.0,
        "_source" : {
          "date" : "2018-12-29T00:06:27.710Z",
          "key" : "cda683e5-d5c7-4769-8e2c-d0a30eca1284"
        }
      }
    ]
  }
}

And in terms of speed, this takes milliseconds, whatever the number of rows. I have tables with many millions of rows and it works like a charm. No need to wait hours or anything like that.

As others have mentioned, Elassandra is still a system heavily used in parallel by many computers. The counters will change quickly if you have many updates all the time. So the numbers you get from Elasticsearch are correct only if you prevent further updates for long enough for the counters to settle. Otherwise it's always going to be an approximate result.

Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
2

For count(*) for big tables, you can use Presto on top of Cassandra. I have tested and it works good.

Please refer below URL for the same: Key Word search: Cassandra question v3.11.3 …

select count(*) from table1

URL: Cassandra question v3.11.3 ... select count(*) from table1

Adrita Sharma
  • 21,581
  • 10
  • 69
  • 79
Hareesha
  • 125
  • 9
1

For those using the C# Linq Component Adapter you can use:

var t = new Table<T>(session);
var count = t.Count().Execute();
Crunchy234
  • 1,887
  • 2
  • 16
  • 21
-2

Consider using ALLOW FILTERING with column constraints and then sum values.

For example:

SELECT count(*)
FROM my_table
WHERE datetime_id >= '2020-09-16' ALLOW FILTERING;

SELECT count(*)
FROM my_table
WHERE datetime_id < '2020-09-16' ALLOW FILTERING;
prost0
  • 19
  • 5
-4

nodetool cfstats | grep -A 1000 KEYSPACE

Replace KEYSPACE for getting details of all tables in that KEYSPACE

Vaibhav
  • 71
  • 1
  • 5