How to select top 3 values from each group in a table with SQL which have duplicates

Question

Assume we have a table which has two columns, one column contains the names of some people and the other column contains some values related to each person. One person can have more than one value. Each value has a numeric type. The question is we want to select the top 3 values for each person from the table. If one person has less than 3 values, we select all the values for that person.

The issue can be solved if there are no duplicates in the table by the query provided in this article Select top 3 values from each group in a table with SQL . But if there are duplicates, what is the solution?

For example, if for one name John, he has 5 values related to him. They are 20,7,7,7,4. I need to return the name/value pairs as below order by value descending for each name:

-----------+-------+
| name     | value |
-----------+-------+
| John     |    20 |
| John     |     7 |
| John     |     7 |
-----------+-------+

Only 3 rows should be returned for John even though there are three 7s for John.

What database are you using? You tagged this with mysql, sql-server, oracle and hadoop. The syntax could be different in each database, providing more specific details would be helpful. — Taryn, May 23 '13 at 17:44
`Select * from LackOfDetails Where DataBaseSoftware is NULL` — Zane, May 23 '13 at 17:46
Which database(s) are you really interested in doing this with? Oracle and SQL Server support analytical queries which make this pretty simple (once you get used to the new syntax). MySQL can do this too but the answer is radically different. — Ed Gibbs, May 23 '13 at 17:46
@bluefeet Actually this is a general question. What I want to do is just using the general SQL syntax, no need specified SQL functions in different databases. — PixelsTech, May 23 '13 at 17:49
@EdGibbs This is applicable to all the databases. Initially I just want to use general SQL syntax to achieve this. No need to use specified functions in different databases. — PixelsTech, May 23 '13 at 17:51
@sonic0002 Then don't tag it with any databases and be specific that you want a generic sql solution not database specific. — Taryn, May 23 '13 at 17:52
Top 3 numeric value by value, or by occurrence? IE 4 appears 3 times for John, so it's the top rank? Either way, you'll want to assign a rank to each value, then select from that group the distinct names and values where the rank is less than or equal to 3. — Hart CO, May 23 '13 at 17:54
Does the table have any other columns? What is the primary key? — ypercubeᵀᴹ, May 23 '13 at 18:19
@ypercube It can have other columns. But to simplify it, we just want the table having two columns and there is no primary key. — PixelsTech, May 23 '13 at 18:23
This is not good. All tables should have a primary key in DBMS and no duplicate rows. Otherwise it's not a simplification but unneeded complications. — ypercubeᵀᴹ, May 23 '13 at 18:27
@ypercube As I said, it can have more columns and primary keys. But I don't care about other columns, I only care about these two columns. Is this assumption reasonable? — PixelsTech, May 23 '13 at 18:30

ypercubeᵀᴹ · Answer 1 · 2013-05-23T19:13:19.897

In many modern DBMS (e.g. Postgres, Oracle, SQL-Server, DB2 and many others), the following will work just fine. It uses CTEs and ranking function ROW_NUMBER() which is part of the latest SQL standard:

 WITH cte AS
  ( SELECT name, value,
           ROW_NUMBER() OVER (PARTITION BY name
                              ORDER BY value DESC
                             )
             AS rn
    FROM t
  )
SELECT name, value, rn
FROM cte
WHERE rn <= 3
ORDER BY name, rn ;

Without CTE, only ROW_NUMBER():

SELECT name, value, rn
FROM 
  ( SELECT name, value,
           ROW_NUMBER() OVER (PARTITION BY name
                              ORDER BY value DESC
                             )
             AS rn
    FROM t
  ) tmp 
WHERE rn <= 3
ORDER BY name, rn ;

Tested in:

Postgres
Oracle
SQL-Server

In MySQL and other DBMS that do not have ranking functions, one has to use either derived tables, correlated subqueries or self-joins with GROUP BY.

The (tid) is assumed to be the primary key of the table:

SELECT t.tid, t.name, t.value,              -- self join and GROUP BY
       COUNT(*) AS rn
FROM t
  JOIN t AS t2
    ON  t2.name = t.name
    AND ( t2.value > t.value
        OR  t2.value = t.value
        AND t2.tid <= t.tid
        )
GROUP BY t.tid, t.name, t.value
HAVING COUNT(*) <= 3
ORDER BY name, rn ;


SELECT t.tid, t.name, t.value, rn
FROM
  ( SELECT t.tid, t.name, t.value,
           ( SELECT COUNT(*)                -- inline, correlated subquery
             FROM t AS t2
             WHERE t2.name = t.name
              AND ( t2.value > t.value
                 OR  t2.value = t.value
                 AND t2.tid <= t.tid
                  )
           ) AS rn
    FROM t
  ) AS t
WHERE rn <= 3
ORDER BY name, rn ;

Tested in MySQL

@jwg In recent versions of Postgres, you can even make it faster (if the number of groups (`name` here) is small) by using a `LATERAL` join. — ypercubeᵀᴹ, Nov 02 '15 at 11:41

Gordon Linoff · Answer 2 · 2013-05-23T18:28:52.117

0

I was going to downvote the question. However, I realized that it might really be asking for a cross-database solution.

Assuming you are looking for a database independent way to do this, the only way I can think of uses correlated subqueries (or non-equijoins). Here is an example:

select distinct t.personid, val, rank
from (select t.*,
             (select COUNT(distinct val) from t t2 where t2.personid = t.personid and t2.val >= t.val
             ) as rank
      from t
     ) t
where rank in (1, 2, 3)

However, each database that you mention (and I note, Hadoop is not a database) has a better way of doing this. Unfortunately, none of them are standard SQL.

Here is an example of it working in SQL Server:

with t as (
      select 1 as personid, 5 as val union all
      select 1 as personid, 6 as val union all
      select 1 as personid, 6 as val union all
      select 1 as personid, 7 as val union all
      select 1 as personid, 8 as val
     )
select distinct t.personid, val, rank
from (select t.*,
             (select COUNT(distinct val) from t t2 where t2.personid = t.personid and t2.val >= t.val
             ) as rank
      from t
     ) t
where rank in (1, 2, 3);

edited May 23 '13 at 18:28

answered May 23 '13 at 17:50

Gordon Linoff

1,242,037
58
646
786

Is this working if there are duplicate entries? I tried it, it seems it's not working – PixelsTech May 23 '13 at 17:58
If you tried it, what DB did you try it on? Not sure why we aren't allowed to know what the target DB is. – Bill Gregg May 23 '13 at 18:10
@wagregg I tried it on MySQL – PixelsTech May 23 '13 at 18:15
@sonic0002 . . . It's fixed. I left out the `where` clause in the original query -- I was more focused on the harder problem of calculating the rank. – Gordon Linoff May 23 '13 at 18:37
@GordonLinoff I am sorry. The above query returns the result set of 20,7,4 for John instead of 20,7,7. This is not expected. – PixelsTech May 23 '13 at 19:02

score 0 · Answer 3 · answered May 23 '13 at 18:42

0

Using GROUP_CONCAT and FIND_IN_SET you can do that.Check SQLFIDDLE.

SELECT *
FROM tbl t
WHERE FIND_IN_SET(t.value,(SELECT
                             SUBSTRING_INDEX(GROUP_CONCAT(t1.value ORDER BY VALUE DESC),',',3)
                           FROM tbl t1
                           WHERE t1.name = t.name
                           GROUP BY t1.name)) > 0
ORDER BY t.name,t.value desc

answered May 23 '13 at 18:42

Deval Shah

1,094
8
22

I am sorry. This one also works not as expected. If we have three 4s for A in your example on SQL FIDDLE, the result set will be 20,4,4,4 for A instead of 20,4,4. – PixelsTech May 23 '13 at 19:10

score 0 · Answer 4 · answered May 23 '13 at 18:57

If your result set is not so heavy, you can write a stored procedure (or an anonymous PL/SQL-block) for that problem which iterates the result set and finds the bigges three by a simple comparing algorithm.

score 0 · Answer 5 · answered May 24 '13 at 01:34

Try this -

CREATE TABLE #list ([name] [varchar](100) NOT NULL, [value] [int] NOT NULL)
INSERT INTO #list VALUES ('John', 20), ('John', 7), ('John', 7), ('John', 7), ('John', 4);

WITH cte
AS (
SELECT NAME
    ,value
    ,ROW_NUMBER() OVER (
        PARTITION BY NAME ORDER BY (value) DESC
        ) RN
FROM #list
)
SELECT NAME
,value
FROM cte
WHERE RN < 4
ORDER BY value DESC

score 0 · Answer 6 · answered May 24 '13 at 02:05

This works for MS SQL. Should be workable in any other SQL dialect that has the ability to assign row numbers in a group by or over clause (or equivelant)

if object_id('tempdb..#Data') is not null drop table #Data;
GO

create table #data (name varchar(25), value integer);
GO
set nocount on;
insert into #data values ('John', 20);
insert into #data values ('John', 7);
insert into #data values ('John', 7);
insert into #data values ('John', 7);
insert into #data values ('John', 5);
insert into #data values ('Jack', 5);
insert into #data values ('Jane', 30);
insert into #data values ('Jane', 21);
insert into #data values ('John', 5);
insert into #data values ('John', -1);
insert into #data values ('John', -1);
insert into #data values ('Jane', 18);
set nocount off;
GO

with D as (
SELECT
     name
    ,Value
    ,row_number() over (partition by name order by value desc) rn
From
    #Data
)
SELECT Name, Value
FROM D
WHERE RN <= 3
order by Name, Value Desc

Name    Value
Jack    5
Jane    30
Jane    21
Jane    18
John    20
John    7
John    7

How to select top 3 values from each group in a table with SQL which have duplicates

6 Answers6

Linked