202

Let's say I have the following query.

SELECT ID, Email, ProductName, ProductModel FROM Products

How can I modify it so that it returns no duplicate Emails?

In other words, when several rows contain the same email, I want the results to include only one of those rows (preferably the last one). Duplicates in other columns should be allowed.

Clauses like DISTINCT and GROUP BY appear to work on entire rows. So I'm not sure how to approach this.

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
Jonathan Wood
  • 65,341
  • 71
  • 269
  • 466
  • 3
    Ok, you need use PARTITION or use two select statements? – Phil C Feb 16 '11 at 20:25
  • And what should be shown if there are say 2 rows with same Email but different ProductName? The *(preferably the last one)* is not clear. Last by which ordering? – ypercubeᵀᴹ Feb 16 '11 at 20:28
  • @ypercube As stated in the question, preferably the last one. However, that's not really critical to me. I just want one of them. – Jonathan Wood Feb 16 '11 at 20:29
  • 1
    You can look at the following questions: [question1](http://stackoverflow.com/questions/932986/select-many-fields-applying-distinct-to-only-one-particular-field), [question2](http://stackoverflow.com/questions/966176/tsql-select-distinct-on-one-column) or [question3](http://stackoverflow.com/questions/1785634/select-distinct-on-one-column-return-multiple-other-columns-sql-server). – Marian Feb 16 '11 at 20:30
  • Why can't you use: SELECT DISTINCT Email, ID, ProductName, ProductModel FROM Products? – Rick Henderson Sep 11 '17 at 13:00

8 Answers8

233

If you are using SQL Server 2005 or above use this:

SELECT *
  FROM (
                SELECT  ID, 
                        Email, 
                        ProductName, 
                        ProductModel,
                        ROW_NUMBER() OVER(PARTITION BY Email ORDER BY ID DESC) rn
                    FROM Products
              ) a
WHERE rn = 1

EDIT: Example using a where clause:

SELECT *
  FROM (
                SELECT  ID, 
                        Email, 
                        ProductName, 
                        ProductModel,
                        ROW_NUMBER() OVER(PARTITION BY Email ORDER BY ID DESC) rn
                    FROM Products
                   WHERE ProductModel = 2
                     AND ProductName LIKE 'CYBER%'

              ) a
WHERE rn = 1
Chandu
  • 81,493
  • 19
  • 133
  • 134
  • 4
    I must investigate this PARTITION clause, never seen it in action before. Thanks for the example – LorenVS Feb 16 '11 at 20:41
  • @Cybernate One complication: My inner `SELECT` needs a `WHERE` condition. I'm thinking the row numbers will be assigned to all rows in the table. This syntax is just a little beyond me. Any chance of an update that would guarantee one row with a particular email that meets the `WHERE` condition? – Jonathan Wood Feb 16 '11 at 21:12
  • 1
    You can add where clause to the inner sql. I will update the post once I can access my laptop – Chandu Feb 16 '11 at 21:24
  • 1
    Updated the post with a sample using where clause. – Chandu Feb 16 '11 at 22:05
  • @Cybernate Do'h! So no other change needed. The expression is applied only to rows matching the `WHERE` condition. Cool. Thanks again. – Jonathan Wood Feb 16 '11 at 23:23
  • @Chandu How can I write the same query in Oracle SQL? – Fortune Apr 19 '17 at 10:07
  • Doesn't this return the first row with the email and not the last? I should definitely look into Partition... – Rick Henderson Sep 11 '17 at 13:02
  • i'm getting only 3 rows for the required distinct field whose rn = 1, how can I get all rows having rn=1,2,3 and so on? – Karan Tikku Jun 20 '18 at 13:39
  • @KaranTikku you can use ROW_NUMBER like any other property, so in your case you should alter the last WHERE clause to WHERE rn <=3. Note, however, that this will not be distinct any more. – DevilSuichiro Oct 16 '18 at 08:30
  • Cool solution, if, _(for visitors)_ ,`"rn = 1"` _doesn't_ make sense, you got to [check _this_](https://www.c-sharpcorner.com/blogs/rownumber-function-with-partition-by-clause-in-sql-server1) – Irf Feb 07 '19 at 12:48
  • 1
    I get this working correctly only when having _no_ `JOIN`s in my query. As soon as I have a `JOIN`, the `ROW_NUMBER` returns much higher values than "1". – Uwe Keim Feb 23 '20 at 13:14
  • I have the same problem as Uwe Keim. The extra row added in select breaks the field count and causes errors in some "joins" or "insert into" queries. – Dan A.S. Jul 16 '20 at 12:42
  • @Chandu how does the performance of this compare to a distinct on all columns? I would think it's faster since it does not have to consider other columns, however it is using a windows function... – Paradox May 11 '23 at 13:11
12

This assumes SQL Server 2005+ and your definition of "last" is the max PK for a given email

WITH CTE AS
(
SELECT ID, 
       Email, 
       ProductName, 
       ProductModel, 
       ROW_NUMBER() OVER (PARTITION BY Email ORDER BY ID DESC) AS RowNumber 
FROM   Products
)
SELECT ID, 
       Email, 
       ProductName, 
       ProductModel
FROM CTE 
WHERE RowNumber = 1
Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
Pero P.
  • 25,813
  • 9
  • 61
  • 85
8

When you use DISTINCT think of it as a distinct row, not column. It will return only rows where the columns do not match exactly the same.

SELECT DISTINCT ID, Email, ProductName, ProductModel
FROM Products

----------------------
1 | something@something.com | ProductName1 | ProductModel1
2 | something@something.com | ProductName1 | ProductModel1

The query would return both rows because the ID column is different. I'm assuming that the ID column is an IDENTITY column that is incrementing, if you want to return the last then I recommend something like this:

SELECT DISTINCT TOP 1 ID, Email, ProductName, ProductModel
FROM Products
ORDER BY ID DESC

The TOP 1 will return only the first record, by ordering it by the ID descending it will return the results with the last row first. This will give you the last record.

jon3laze
  • 3,188
  • 6
  • 36
  • 69
  • 4
    As stated in the question, I see that DISTINCT works on the entire row. I want to do like you suggest above, but for every time the email is duplicated in the results (not just once). – Jonathan Wood Feb 16 '11 at 20:37
  • In that case I would recommend going with @Cybernate answer. That should do exactly what you need. – jon3laze Feb 16 '11 at 20:44
6

You can over that by using GROUP BY like this:

SELECT ID, Email, ProductName, ProductModel
FROM Products
GROUP BY Email
robe007
  • 3,523
  • 4
  • 33
  • 59
Marshall Unduemi
  • 207
  • 2
  • 13
  • 28
    Column 'Products.ID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. – palota Aug 24 '16 at 10:32
  • 3
    This doesn't work without using something like MAX(ID),MAX(ProductName),MAX(ProductModel) for the other columns – avl_sweden Jan 09 '18 at 13:37
  • 4
    In postgres, you only need the aggregate function on the column that will be used in the group by clause, e.g. `SELECT id, max(email) AS email FROM tbl GROUP by email`. In SQL server ALL columns in the `SELECT` clause must be in an aggregate function. This bites me every time I go back. – Bruce Pierson Aug 01 '18 at 15:53
1

For Access, you can use the SQL Select query I present here:

For example you have this table:

CLIENTE|| NOMBRES || MAIL

888 || T800 ARNOLD || t800.arnold@cyberdyne.com

123 || JOHN CONNOR || s.connor@skynet.com

125 || SARAH CONNOR ||s.connor@skynet.com

And you need to select only distinct mails. You can do it with this:

SQL SELECT:

SELECT MAX(p.CLIENTE) AS ID_CLIENTE
, (SELECT TOP 1 x.NOMBRES 
    FROM Rep_Pre_Ene_MUESTRA AS x 
    WHERE x.MAIL=p.MAIL 
     AND x.CLIENTE=(SELECT MAX(l.CLIENTE) FROM Rep_Pre_Ene_MUESTRA AS l WHERE x.MAIL=l.MAIL)) AS NOMBRE, 
p.MAIL
FROM Rep_Pre_Ene_MUESTRA AS p
GROUP BY p.MAIL;

You can use this to select the maximum ID, the correspondent name to that maximum ID , you can add any other attribute that way. Then at the end you put the distinct column to filter and you only group it with that last distinct column.

This will bring you the maximum ID with the correspondent data, you can use min or any other functions and you replicate that function to the sub-queries.

This select will return:

CLIENTE|| NOMBRES || MAIL

888 || T800 ARNOLD || t800.arnold@cyberdyne.com

125 || SARAH CONNOR ||s.connor@skynet.com

Remember to index the columns you select and the distinct column must have not numeric data all in upper case or in lower case, or else it won't work. This will work with only one registered mail as well. Happy coding!!!

jRam90
  • 183
  • 1
  • 2
  • 12
0

Try This

;With Tab AS (SELECT DISTINCT Email FROM  Products)
SELECT Email,ROW_NUMBER() OVER(ORDER BY Email ASC) AS  Id FROM Tab
ORDER BY Email ASC
Abdullah Yousuf
  • 403
  • 5
  • 7
-1

The reason DISTINCT and GROUP BY work on entire rows is that your query returns entire rows.

To help you understand: Try to write out by hand what the query should return and you will see that it is ambiguous what to put in the non-duplicated columns.

If you literally don't care what is in the other columns, don't return them. Returning a random row for each e-mail address seems a little useless to me.

Ram
  • 3,092
  • 10
  • 40
  • 56
JohnFx
  • 34,542
  • 18
  • 104
  • 162
  • @JohnFix I want to return whole rows. I just don't want rows to be returned when the results already include a row with the same value in the Email column. – Jonathan Wood Feb 16 '11 at 20:33
  • So how should it decide which one to return? Do you really want a query that returns an arbitrary row for each e-mail. This really smells like you might need to re-think the problem you are trying to solve. Almost every time I've been asked this question (and it comes up a lot) it turns out the developer hasn't thought through the consequences in the app for this behavior. – JohnFx Feb 16 '11 at 20:34
  • 9
    I'm really having trouble following your logic. As stated in the question, I would prefer the last one (sorted by ID). Yes, if it selected a random row that would be okay. And, yes, I've thought about it. – Jonathan Wood Feb 16 '11 at 20:39
-1

Try this:

SELECT ID, Email, ProductName, ProductModel
FROM Products
WHERE ID IN (SELECT MAX(ID) FROM Products GROUP BY Email)
Community
  • 1
  • 1
IML
  • 181
  • 2
  • 7
  • 2
    Why should we try this? Why is this better than the other answers posted here in the past 8 years? If you want to share a better way to solve the problem, you need to explain why you recommend it. – Dharman Aug 16 '19 at 21:07
  • You can use any answer you want, i share solution that works for me – IML Mar 10 '23 at 14:07