-1

I have a table with values:

CREATE TABLE IF NOT EXISTS `games` (
  `date` DATE,
  `item_id` varchar(40),
  `player_id` varchar(40)
);
INSERT INTO `games` (`date`, `item_id`, `player_id`) VALUES
  ('2023-01-03', 'raven', 'david'),
  ('2023-01-04', 'folly', 'david'),
  ('2023-01-05', 'syd', 'david'),
  ('2023-01-05', 'syd', 'ire'),
  ('2023-01-01', 'raven', 'jane'),
  ('2023-01-03', 'syd', 'jane'),
  ('2023-01-03', 'folly', 'harry'),
  ('2023-01-10', 'syd', 'harry'),
  ('2023-01-10', 'syd', 'yvette')
;

However, these two queries return different results:

SELECT
  g.date,
  g.item_id,
  COUNT(*) AS item_games,
  SUM(1) OVER (PARTITION BY g.date) AS total_matches
FROM games g
GROUP BY g.date, g.item_id
ORDER BY g.date, g.item_id;

returns:

2023-01-01  raven   1   1
2023-01-03  folly   1   3
2023-01-03  raven   1   3
2023-01-03  syd     1   3
2023-01-04  folly   1   1
2023-01-05  syd     2   1
2023-01-10  syd    2    1

while

SELECT
  g.date,
  g.item_id,
  COUNT(*) AS item_games,
  SUM(count(*)) OVER (PARTITION BY g.date) AS item_game_ratio
FROM games g
GROUP BY g.date, g.item_id
ORDER BY g.date, g.item_id;

returns:

2023-01-01  raven   1   1
2023-01-03  folly   1   3
2023-01-03  raven   1   3
2023-01-03  syd     1   3
2023-01-04  folly   1   1
2023-01-05  syd     2   2
2023-01-10  syd     2   2

I don't understand why they only return different values for the last two rows.

Thom A
  • 88,727
  • 11
  • 45
  • 75
cozyss
  • 1,290
  • 1
  • 15
  • 22
  • If it's just "SUM(1) AS total_matches" without the window function than the results are the same – cozyss May 01 '23 at 04:27

3 Answers3

4

The window function is executed after the GROUP BY clause. The logical order of execution for an SQL query is as follows:

FROM and JOINs
WHERE
GROUP BY
HAVING
WINDOW functions
SELECT
DISTINCT
ORDER BY
LIMIT / OFFSET

In your query, the GROUP BY clause is applied first, followed by the window function

  1. (SUM(1) OVER (PARTITION BY g.date)) here you are counting the rows of the group by result for each date.
  2. However count(*) is performed as part of the GROUP BY, so for SUM(count(*)) OVER (PARTITION BY g.date) the data that is being summed will be the number of source rows for each date.
Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51
1

It looks to me like the first query is telling you how many items are used on that date. The second query is telling you how many players are using the item on that date.

They return different values for the last 2 routes because syd has 2 games on each of those days.

1

SUM(1) is something you'll rarely see, because adding a 1 for each row is just an obfuscated way of counting the rows.

SUM(1) OVER (PARTITION BY g.date)

or

COUNT(*) OVER (PARTITION BY g.date)

count how many rows exist for a date.

SUM(COUNT(*)) OVER (PARTITION BY g.date)

on the other hand adds up all the single counts for each date. Date 2023-01-05 (and 2023-01-10 for that matter) has one result row with COUNT(*) = 2. Building the sum still gives you 2, because with one row and one value, there is nothing to add up.

As your results show one row per date and item, SUM(1) OVER (PARTITION BY g.date) gets you the number of different items on a date, while SUM(COUNT(*)) OVER (PARTITION BY g.date) gives you the number of games on a date.

Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73