2

I have a table like the following (example here):

CREATE TABLE topics (
  name varchar(64),
  url varchar(253),
  statistic integer,
  pubdate timestamp
);

INSERT INTO topics VALUES
('a',  'b',  100,  TIMESTAMP '2011-05-16 15:36:38'),  
('a',  'c',  110,  TIMESTAMP '2014-04-01 00:00:00'),  
('a',  'd',  120,  TIMESTAMP '2014-04-01 00:00:00'),  
('a',  'e',  90,   TIMESTAMP '2011-05-16 15:36:38'), 
('a',  'f',  80,   TIMESTAMP '2014-04-01 00:00:00'), 
('a',  'g',  70,   TIMESTAMP '2011-05-16 15:36:38'), 
('a',  'h',  150,  TIMESTAMP '2014-04-01 00:00:00'),  
('a',  'i',  50,   TIMESTAMP '2011-05-16 15:36:38'), 
('b',  'j',  10,   TIMESTAMP '2014-04-01 00:00:00'), 
('b',  'k',  11,   TIMESTAMP '2011-05-16 15:36:38'), 
('b',  'l',  12,   TIMESTAMP '2014-04-01 00:00:00'), 
('b',  'm',  9,    TIMESTAMP '2011-05-16 15:36:38'),
('b',  'n',  8,    TIMESTAMP '2014-04-01 00:00:00'),
('b',  'o',  7,    TIMESTAMP '2011-05-16 15:36:38'),
('b',  'p',  15,   TIMESTAMP '2014-04-01 00:00:00'), 
('b',  'q',  5,    TIMESTAMP '2011-05-16 15:36:38'),
('b',  'r',  2,    TIMESTAMP '2014-04-01 00:00:00')

I would like to take the top two rows based on their statistic value _from each (name, date(pubdate)) combination.

In other words, I would like to GROUP BY name, date(pubdate), but without an aggregate function, and instead with the result simply taking the top two rows based on their statistic from each group. (So, I know it is not really a GROUP BY, but rather a greatest-n-per-group.)

I'm using Google Big Query with Standard SQL. I've looked at a number of other solutions but am unsure how to achieve the result in this case.

Desired result:

name    url     statistic   date

a       b       100         2011-05-16
a       e       90          2011-05-16

a       h       150         2014-04-01
a       d       120         2014-04-01

b       m       9           2011-05-16
b       k       11          2011-05-16

b       l       12          2014-04-01
b       p       15          2014-04-01
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • Do you have a column with a primary key also as that is needed to get a pure deterministic (fixed) result always as the results can still be non deterministic (random) without even when using `ORDER BY statistic` as the statistic column values are not unique – Raymond Nijland Jun 14 '19 at 17:12
  • 1
    You mean that in the case of "ties" for `statistic` the result would not be guaranteed to be deterministic? @RaymondNijland – Brad Solomon Jun 14 '19 at 17:12
  • 1
    *"You mean that in the case of "ties" for statistic the result would not be guaranteed to be deterministic? "* yes indeed that is what i mean @BradSolomon ideally you should use `ORDER BY some_column, ` – Raymond Nijland Jun 14 '19 at 17:14

3 Answers3

5

Below is for BigQuery Standard SQL

#standardSQL
SELECT * EXCEPT(arr) FROM (
  SELECT name, DATE(pubdate) day, 
    ARRAY_AGG(STRUCT(url, statistic) ORDER BY statistic DESC LIMIT 2) arr
  FROM `project.dataset.table`   
  GROUP BY name, day
), UNNEST(arr)
-- ORDER BY name, day  

You can test, play with above using sample data in your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'a' name, 'b' url,  100 statistic,  TIMESTAMP '2011-05-16 15:36:38' pubdate UNION ALL  
  SELECT 'a', 'c',  110,  '2014-04-01 00:00:00' UNION ALL  
  SELECT 'a', 'd',  120,  '2014-04-01 00:00:00' UNION ALL  
  SELECT 'a', 'e',  90,   '2011-05-16 15:36:38' UNION ALL 
  SELECT 'a', 'f',  80,   '2014-04-01 00:00:00' UNION ALL 
  SELECT 'a', 'g',  70,   '2011-05-16 15:36:38' UNION ALL 
  SELECT 'a', 'h',  150,  '2014-04-01 00:00:00' UNION ALL  
  SELECT 'a', 'i',  50,   '2011-05-16 15:36:38' UNION ALL 
  SELECT 'b', 'j',  10,   '2014-04-01 00:00:00' UNION ALL 
  SELECT 'b', 'k',  11,   '2011-05-16 15:36:38' UNION ALL 
  SELECT 'b', 'l',  12,   '2014-04-01 00:00:00' UNION ALL 
  SELECT 'b', 'm',  9,    '2011-05-16 15:36:38' UNION ALL
  SELECT 'b', 'n',  8,    '2014-04-01 00:00:00' UNION ALL
  SELECT 'b', 'o',  7,    '2011-05-16 15:36:38' UNION ALL
  SELECT 'b', 'p',  15,   '2014-04-01 00:00:00' UNION ALL 
  SELECT 'b', 'q',  5,    '2011-05-16 15:36:38' UNION ALL
  SELECT 'b', 'r',  2,    '2014-04-01 00:00:00' 
)
SELECT * EXCEPT(arr) FROM (
  SELECT name, DATE(pubdate) day, 
    ARRAY_AGG(STRUCT(url, statistic) ORDER BY statistic DESC LIMIT 2) arr
  FROM `project.dataset.table`  
  GROUP BY name, day
), UNNEST(arr)
ORDER BY name, day   

with result

Row name    day         url statistic    
1   a       2011-05-16  b   100  
2   a       2011-05-16  e   90   
3   a       2014-04-01  h   150  
4   a       2014-04-01  d   120  
5   b       2011-05-16  k   11   
6   b       2011-05-16  m   9    
7   b       2014-04-01  p   15   
8   b       2014-04-01  l   12   
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
4

Use the ARRAY_AGG function:

SELECT
  name,
  DATE(pubdate) AS pubdate,
  ARRAY_AGG(STRUCT(url, statistic) ORDER BY statistic DESC LIMIT 2) AS top_urls
FROM dataset.table
GROUP BY name, pubdate

You can use a subquery with UNNEST to get rows as output without arrays:

SELECT name, pubdate, url, statistic
FROM (
  SELECT
    name,
    DATE(pubdate) AS pubdate,
    ARRAY_AGG(STRUCT(url, statistic) ORDER BY statistic DESC LIMIT 2) AS top_urls
  FROM dataset.table
  GROUP BY name, pubdate
), UNNEST(top_urls)
Elliott Brossard
  • 32,095
  • 2
  • 67
  • 99
0
    with xx as(
      select name, url, statistic, pubdate, row_number() over(partition by name , url order by statistic desc) rn 
      from topics)
select * except(rn) 
from xx 
where rn <= 2;
Aleksandr
  • 1,814
  • 11
  • 19