36

I was reading this discussion in another post where this question was raised by someone else. Before reading the discussion, I always thought SQL Server (and other DBMS) keep a global count of rows for each table somewhere in the metadata, but the discussion seems to say it isn't so. Why? Count(*) (without any filtering) being such a common operation would get huge boost if it is O(1). Even not considering COUNT(*), the total number of rows in a table is such a fundamental piece of information. Why don't they keep a note of it?

In addition, why do we need to "load" entire rows (as indicated in the post I linked) just to count them? Shouldn't indexes or PKs etc. be sufficient to count them?

dotNET
  • 33,414
  • 24
  • 162
  • 251
  • 4
    You would have to ask the developers of SQL Server. As far as I know, no database implements `COUNT(*)` as a constant time operation. – Gordon Linoff May 24 '17 at 12:09
  • 15
    My immediate thought is "locking and transactions" - any central piece of data has to be kept up to date as rows are added and removed, and that can be in multiple simultaneous threads, including uncommitted transactions. That makes it less trivial than it seems at first. Also, count(*) of the whole table (i.e. with no WHERE clause) is not that common as a production query, more as a debugging thing, I'd have thought, so maybe not worth effort to optimise for? – IMSoP May 24 '17 at 12:12
  • 1
    As a matter of fact, they do keep a note of it -- it's `rows` in `sys.partitions`. Using this to implement a faster `COUNT`, as well as finding out the limitations of this approach, is left as an exercise to the reader. – Jeroen Mostert May 24 '17 at 12:17
  • 1
    There are [several ways to count records in a table](https://www.brentozar.com/archive/2014/02/count-number-rows-table-sql-server/) in SQL Server. – alroc May 24 '17 at 12:18
  • 6
    @JeroenMostert - just to be clear, the rows number in sys.partitions is an estimate; it says so in the documentation (https://learn.microsoft.com/en-us/sql/relational-databases/system-catalog-views/sys-partitions-transact-sql). And this makes a lot of sense for the "locking and transactions" argument put forth by IMSoP. – Ben Thul May 24 '17 at 14:30
  • 2
    @BenThul: yes -- in fact, it can even get *negative* if you're unlucky. Even so, it tends to be accurate enough if you only need an indication of whether there are 10, 1000, or 1 billion rows in a table, and in the latter case, it will certainly beat `COUNT(*)` for determining this. But you don't want to use it if you're, say, producing invoices based on article counts. – Jeroen Mostert May 24 '17 at 14:33
  • @GordonLinoff `COUNT(*)` is optimised when using [the MyISAM storage engine on MySQL](https://stackoverflow.com/questions/6890960/is-select-count-expensive). You could of course make the reasonable argument that MyISAM isn't a database. – Philip Kendall May 25 '17 at 08:31

2 Answers2

56

No, COUNT(*) is not a constant time operation. A COUNT(*) must return a count of rows that conform to the current scan predicate (ie. WHERE clause), so that alone would make the return of a metadata property invalid. But even if you have no predicates, the COUNT still has to satisfy the current transaction isolation semantics, ie. return the count of rows visible (eg. committed). So COUNT must, and will, in SQL Server, actually scan and count the rows. Some systems allow return of faster 'estimate' counts.

Also, as a side comment, relying on rows in sys.partitions is unreliable. After all, if this count would be guaranteed accurate then we would not need DBCC UPDATEUSAGE(...) WITH COUNT_ROWS. There are several scenarios that historically would cause this counter to drift apart from reality (mostly minimally logged insert rollbacks), all I know of are fixed, but that still leaves the problems of 1) upgraded tables from earlier versions that had the bugs and 2) other, not yet discovered, bugs.

In addition, why do we need to "load" entire rows (as indicated in the post I linked) just to count them? Shouldn't indexes or PKs etc. be sufficient to count them?

This is not 100% true. There are at least 2 scenarios that do no 'load entire rows':

  • narrow rowstore indexes load just the 'index' row, which may be much smaller
  • columnstore data loads just the relevant column segments

And most of what I say above do not apply for Hekaton tables.

chwarr
  • 6,777
  • 1
  • 30
  • 57
Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
21

why do we need to "load" entire rows

We don't. SQL Server will tend to use the smallest index that it can use that can satisfy the query.

Count(*) (without any filtering) being such a common operation

I think you over-estimate its prevalence. I can't remember the last time that I cared about the total number of rows in a single table versus a more filtered view or a count across a more complex joined operation.

It would be an exceptionally narrow optimization that could only benefit a single style of query, and as I say, I think you've overestimated how often it occurs.

Damien_The_Unbeliever
  • 234,701
  • 27
  • 340
  • 448
  • 3
    One scenario where getting a count over all rows is common and worth optimizing is when a table is used to implement a queue, and it needs to be monitored for size frequently. There are many other issues and caveats with using tables as queues, of course, but I just wanted to point it out. Note that even when using Service Broker, which is specifically touted as a good way to implement queues in SQL Server, a `COUNT(*)` over a queue slows down with the number of rows just as it would for a regular table, so using one of the fast approximations of `COUNT(*)` is very worthwhile there. – Jeroen Mostert May 24 '17 at 12:29
  • Thanks for the valuable input. Thinking about it again I realized that the occurrence/need of unfiltered `COUNT(*)` is indeed infrequent. – dotNET May 24 '17 at 13:33
  • @JeroenMostert Why do you need to monitor a queue for its size frequently? Wouldn't it be sufficient to simply check if there's more than 0 records (i.e. fetch one)? – mustaccio May 24 '17 at 14:22
  • 1
    @mustaccio: if it was realistic to continuously have an empty queue, you wouldn't need a queue in the first place. For some applications, fetching the oldest row and seeing if it's not too old may suffice, and that would be an index seek. For others, you may wish to check if the queue is not "too large", which can be done faster than a plain `COUNT` by including a `TOP`. You still won't get around wanting to have an actual count when the queue is large, which is exactly the worst case for `COUNT(*)`. – Jeroen Mostert May 24 '17 at 14:28
  • Unfiltered `COUNT(*)` is more frequent in interactive use than in applications -- if I'm running a script that's loading a table, I might use it periodically in another window to monitor progress. – Barmar May 24 '17 at 19:42