Find most occurring Col3 for every pair of Col1 and Col2

Question

Given a table myTable with 4 columns, say Col1, Col2, Col3 and Col4:

A X 5 B
A Y 5 C
A X 7 D
A Y 3 E 
A X 7 F

I need to find most occurring col3 for each pair (col1, col2).

So result for this example will be:

A X 7   D/F  -- D or F
A Y 5/3 C/E  -- It can be 5 and C or 3 and E

So I wrote a query something like this :

select Col1,Col2,Col3 
from myTable M 
group by Col1,Col2,Col3 
having Col3 =
     (select Col3 
      from myTable N 
      where M.Col1=N.col1 
      group by Col3 
      order by Col3 desc limit 1);

But the query does not give the desired result.
Also I have no idea how to get Col4 as am making group by clause, and I don't want to make group by according to Col4.

For each (Col1, Col2) pair I want single Col4 that goes along with the maximum occurring Col3.

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

You only need a single subquery with DISTINCT ON:

SELECT DISTINCT ON (col1, col2)
       col1, col2, col3, min(col4) As col4
FROM   tbl
GROUP  BY col1, col2, col3
ORDER  BY col1, col2, count(*) DESC, col3;

This way you get a single row per (col1, col2) with the most common col3 (the samllest value if multiple tie for "most common") an the smallest col4 to go along with that col3.

Similarly, to get all qualifying col3 you can use the window function rank() in a subquery, which is also executed after the aggregation:

SELECT col1, col2, col3, col4_list
FROM  (
   SELECT col1, col2, col3, count(*) AS ct, string_agg(col4, '/') AS col4_list
        , rank() OVER (PARTITION BY col1, col2 ORDER BY count(*) DESC) AS rnk
   FROM   tbl
   GROUP  BY col1, col2, col3
   ) sub
WHERE  rnk = 1
ORDER  BY col1, col2, col3;

This works, because you can run window functions over aggregate functions.
Cast to text if the data type is not a character type.

Or, all qualifying col3 per (col1, col2) in a list, plus all matching col4 in a second list:

SELECT col1, col2
     , string_agg(col3::text, '/') AS col3_list  -- cast if necessary
     , string_agg(col4_list,  '/') AS col4_list
FROM  (
   SELECT col1, col2, col3, count(*) AS ct, string_agg(col4, '/') AS col4_list
        , rank() OVER (PARTITION BY col1, col2 ORDER BY count(*) DESC) AS rnk
   FROM   tbl
   GROUP  BY col1, col2, col3
   ) sub
WHERE  rnk = 1
GROUP  BY col1, col2
ORDER  BY col1, col2, col3_list;

Solution for Amazon Redshift

row_number() is available, so this should work:

SELECT col1, col2, col3, col4
FROM  (
   SELECT col1, col2, col3, min(col4) AS col4
        , row_number() OVER (PARTITION BY col1, col2
                             ORDER BY count(*) DESC, col3) AS rn
   FROM   tbl
   GROUP  BY col1, col2, col3
   ) sub
WHERE  rn = 1
ORDER  BY col1, col2;

Or if window functions over aggregate functions are not allowed, use another subquery

SELECT col1, col2, col3, col4
FROM  (
   SELECT *, row_number() OVER (PARTITION BY col1, col2
                                ORDER BY ct DESC, col3) AS rn
   FROM (
      SELECT col1, col2, col3, min(col4) AS col4, COUNT(*) AS ct
      FROM   tbl
      GROUP  BY col1, col2, col3
      ) sub1
   ) sub2
WHERE  rn = 1;

This picks the smallest col3 if more than one tie for the maximum count. And the smallest col4 for the respective col3.

SQL Fiddle demonstrating all in Postgres 9.3.

I want any one of the two, in Col4 and not both. I dont want any array list. Just see, I had put OR and not AND — GitCoder, Sep 23 '15 at 10:08
If you really don't care which col4, `min(col4)` or `max(col4)` are easy substitutes for the string_agg. — , Sep 23 '15 at 12:50
@GitCoder: Or you can combine the aggregate with `DISTINCT ON`. — Erwin Brandstetter, Sep 24 '15 at 13:01
@ErwinBrandstetter Check this http://sqlfiddle.com/#!15/f4b19/1 . I want that instead of two rows for (A,Y) pair, only single row should be there — GitCoder, Sep 25 '15 at 22:22
@WumpusQ.Wumbley Check this http://sqlfiddle.com/#!15/f4b19/1 . I want that instead of two rows for (A,Y) pair, only single row should be there — GitCoder, Sep 25 '15 at 22:22
@GitCoder: You may have missed my update from yesterday. The solution is already there (first query). I updated the fiddle now, too. — Erwin Brandstetter, Sep 25 '15 at 22:45
@ErwinBrandstetter DISTINCT ON is not supported with my PSQL. So whats way out ? — GitCoder, Sep 26 '15 at 00:34
@GitCoder: What is "your PSQL" exactly? (This kind of information should be among the first things in your question.) — Erwin Brandstetter, Sep 26 '15 at 00:39
@GitCoder: Next time please declare your actual RDBMS up front. It's a frustrating process to waste time until you come forward. — Erwin Brandstetter, Sep 26 '15 at 01:20

Mureinik · Answer 2 · 2015-09-22T22:43:40.530

0

One way of doing this is with the row_number() window function on top of the aggregate query:

SELECT col1, col2, col3
FROM   (SELECT col1, col2, col3, 
        ROW_NUMBER () OVER (PARTITION BY col1, col2 ORDER BY cnt DESC) AS rn
        FROM (SELECT   col1, col2, col3, COUNT(*) AS cnt
              FROM     mytable
              GROUP BY col1, col2, col3) t
       ) q
WHERE  rn = 1

edited Sep 22 '15 at 22:43

answered Sep 22 '15 at 22:31

Mureinik

297,002
52
306
350

ERROR: syntax error at or near "PARTITION" LINE 3: row_number (PARTITION BY comp_sku, zone_id ORDER BY ...What can be the reason ? – GitCoder Sep 22 '15 at 22:41
@GitCoder the reason is I wasn't paying attention and completely forgot the `over` clause. Edited and fixed, try now. – Mureinik Sep 22 '15 at 22:44
Suppose if we have one more column Col4, and want in result, then how to get this also. Because we dont want it in group by clause – GitCoder Sep 22 '15 at 23:36
Like for any Col1,Col2 pair any Col4 matching the criteria will work out if there are multiple options. – GitCoder Sep 23 '15 at 00:27

Find most occurring Col3 for every pair of Col1 and Col2

2 Answers2

Solution for Amazon Redshift