1

I've run into a subtlety around count(*) and join, and a hoping to get some confirmation that I've figured out what's going on correctly. For background, we commonly convert continuous timeline data into discrete bins, such as hours. And since we don't want gaps for bins with no content, we'll use generate_series to synthesize the buckets we want values for. If there's no entry for, say 10AM, fine, we stil get a result. However, I noticed that I'm sometimes getting 1 instead of 0. Here's what I'm trying to confirm:

  • The count is 1 if you count the "grid" series, and 0 if you count the data table.
  • This only has to do with count, and no other aggregate.

The code below sets up some sample data to show what I'm talking about:

DROP TABLE IF EXISTS analytics.measurement_table CASCADE;

CREATE TABLE IF NOT EXISTS analytics.measurement_table (
    hour smallint NOT NULL DEFAULT NULL,
    measurement smallint NOT NULL DEFAULT NULL
);


INSERT INTO measurement_table (hour, measurement)
     VALUES ( 0, 1),
            ( 1, 1),  ( 1, 1),
            (10, 2),  (10, 3), (10, 5); 

Here are the goal results for the query. I'm using 12 hours to keep the example results shorter.

Hour Count sum
 0    1     1
 1    2     2
 2    0     0
 3    0     0
 4    0     0
 5    0     0
 6    0     0
 7    0     0
 8    0     0
 9    0     0
10    3    10
11    0     0
12    0     0

This works correctly:

WITH hour_series AS (
    select * from generate_series (0,12) AS hour
    ) 

    SELECT hour_series.hour,
           count(measurement_table.hour) AS frequency,
           COALESCE(sum(measurement_table.measurement), 0) AS total

      FROM hour_series
LEFT JOIN measurement_table ON (measurement_table.hour = hour_series.hour)

GROUP BY 1
ORDER BY 1

This returns misleading 1's on the match:

WITH hour_series AS (
    select * from generate_series (0,12) AS hour
    ) 

    SELECT hour_series.hour,
           count(*) AS frequency,
           COALESCE(sum(measurement_table.measurement), 0) AS total

      FROM hour_series
LEFT JOIN measurement_table ON (hour_series.hour = measurement_table.hour)

GROUP BY 1
ORDER BY 1
0   1   1
1   2   2
2   1   0
3   1   0
4   1   0
5   1   0
6   1   0
7   1   0
8   1   0
9   1   0
10  3   10
11  1   0
12  1   0

The only difference between these two examples is the count term:

count(*) -- A result of 1 on no match, and a correct count otherwise.
count(joined to table field) -- 0 on no match, correct count otherwise.

That seems to be it, you've got to make it explicit that you're counting the data table. Otherwise, you get a count of 1 since the series data is matching once. Is this a nuance of joinining, or a nuance of count in Postgres?

Does this impact any other aggrgate? It seems like it sholdn't.

P.S. generate_series is just about the best thing ever.

GMB
  • 216,147
  • 25
  • 84
  • 135
Morris de Oryx
  • 1,857
  • 10
  • 28
  • Just a comment - don’t use * – RamblinRose May 18 '20 at 23:51
  • Can you expand on this point? I use count(*) everywhere by default. Clearly, that's lead me astray in this situation....but, otherwise, it's easy to read. – Morris de Oryx May 18 '20 at 23:58
  • How is this not answered by reading the DBMS manual re count? Or googling SO? This is a faq. Before considering posting please read your textbook and/or manual & google any error message or many clear, concise & precise phrasings of your question/problem/goal, with & without your particular strings/names & site:stackoverflow.com & tags; read many answers. If you post a question, use one phrasing as title. Reflect your research. See [ask] & the voting arrow mouseover texts. – philipxy May 19 '20 at 00:07
  • Does this answer your question? [In SQL, what's the difference between count(column) and count(\*)?](https://stackoverflow.com/questions/59294/in-sql-whats-the-difference-between-countcolumn-and-count) – philipxy May 19 '20 at 00:10
  • philipxy, thanks for the thread reference. – Morris de Oryx May 19 '20 at 00:55
  • You can click on something associated with my comment/link to mark this as a duplicate question. – philipxy May 20 '20 at 04:43

1 Answers1

2

You figured out the problem correctly: count() behaves differently depending on the argument is is given.

count(*) counts how many rows belong to the group. This just cannot be 0 since there is always at least one row in a group (otherwise, there would be no group).

On the other hand, when given a column name or expression as argument, count() takes in account any non-null value, and ignores null values. For your query, this lets you distinguish groups that have no match in the left joined table from groups where there are matches.

Note that this behavior is not Postgres specific, but belongs to the standard ANSI SQL specification (all databases that I know conform to it).

Bottom line:

  • in general cases, uses count(*); this is more efficient, since the database does not need to check for nulls (and makes it clear to the reader of the query that you just want to know how many rows belong to the group)

  • in specific cases such as yours, put the relevant expression in the count()

GMB
  • 216,147
  • 25
  • 84
  • 135