0

Yes I know this question is similar to this thread: COUNT(*) vs. COUNT(1) vs. COUNT(pk): which is better?, but this is a bit different.

My senior said that getting result from count(PrimaryKey), assuming that PrimaryKey cannot be NULL, is somehow faster than doing a normal count(*). Is this true?

If this is true, is it true for all RDBMS? Please refer to (semi-)official document if it's possible.

Community
  • 1
  • 1
zfm
  • 1,906
  • 2
  • 17
  • 28
  • @Ian: did you actually read my first statement about the possibility of it being similar? – zfm Sep 14 '12 at 03:20
  • I did, and then forgot it along the way somewhere -- sorry, I'll try to make it up to you with some research :) – Ian Clelland Sep 14 '12 at 05:29

2 Answers2

2

No. This appears to be a persistent misconception, based on a confusion between the syntax

SELECT * FROM ...

and

SELECT COUNT(*) FROM ...

In the first case, * refers to all columns, and returning those certainly requires more resources than returning a single column. In the second case, COUNT(*) is simply shorthand for "count all rows". The mistaken belief is that COUNT(*) somehow instructs the database engine to examine all columns in all rows, whereas COUNT(<pk_field>) would only have to look at one column.

There are a number of other comments on SO here that reference the SQL-92 standard, which explicitly states that COUNT(*) should just refer to the cardinality of the table, so, at least in theory, database engines should be able to recognize and optimize that.

As far as I can tell, in both cases, most database engines (Postgres, Oracle, MySQL InnoDB) will just perform an index scan to count the number of rows. If you specify the PK, then that index will be used; if you just use COUNT(*), then the query planner will pick an index which spans the entire table*, but the performance should be identical.

The only exception to this that I can find is MySQL with MyISAM tables -- those tables cache the number of rows, so COUNT(*) is very fast. However, the query planner also recognizes COUNT(<field>), where <field> is any non-null column, as a request for the full table size, and uses the cache in that case as well. (source) So again, no difference in performance.

* Theoretically, if you had no such indexes, then COUNT(*) would be very slow, but in that case, COUNT(<pk>) would be impossible by definition

Community
  • 1
  • 1
Ian Clelland
  • 43,011
  • 8
  • 86
  • 87
0

It doesn't matter for several reasons. First, both notations -- COUNT(1) and COUNT(*) -- are wrong syntax. Consider the same question about the SUM aggregate. Oh, SUM(*) doesn't make any sense; why? Because, summation is iterative execution of the assignment

for( int columnValue : columnList )
   currentSum = currentSum + columnValue;

whereas for COUNT aggregate it looks like this

for( Tuple t : tupleList )
   currentSum = currentSum + 1;

Therefore, the COUNT aggregate shouldn't have any parameters at all!

Then, there are all kind of syntactic quirks, such as count distinct. This simply demonstrates incompetence of SQL designers who tried to squeeze two consecutive actions (selecting distinct tuples, then aggregating) into one operation.

The second reason why it doesn't matter is that in practice you'll encounter myriads of poorly performing queries, and COUNT(1) vs COUNT(*) is never a bottleneck.

Tegiri Nenashi
  • 3,066
  • 2
  • 19
  • 20
  • `COUNT` needs an optional parameter -- if given, it will not count rows where the expression evaluates to `NULL` -- See http://www.postgresql.org/docs/9.2/static/functions-aggregate.html – Ian Clelland Sep 13 '12 at 18:58
  • So, can't you filter out nulls first, then count? In other words, you can explain what composition of two operations only once, as opposed to introducing awkward syntax for each and every usage scenario. – Tegiri Nenashi Sep 13 '12 at 19:12
  • All SQL aggregate functions skip null values: `SUM`, `COUNT`, `MIN`, `MAX`, `AVERAGE`, etc. `COUNT(*)` is the odd one out. If you had to filter first, you wouldn't be able to do something like `SELECT COUNT(column_a), COUNT(column_b) FROM table` to count non-null values in two distinct columns – Ian Clelland Sep 13 '12 at 19:23
  • Sure you can: `select sum(case when mgr is null then 0 else 1 end) cmgr, sum(case when comm is null then 0 else 1 end) csal from emp` – Tegiri Nenashi Sep 13 '12 at 19:54
  • well actually I asked about count(PK), not count(1) – zfm Sep 14 '12 at 03:30
  • count(field) is wrong syntax for the same reason. Again, proper designed count function (which shouldn't have any arguments) wouldn't cause silly discussions like this. – Tegiri Nenashi Sep 14 '12 at 20:08