0

I've inherited some fun SQL and am trying to figure out how to how to eliminate rows with duplicate IDs. Our indexes are stored in a somewhat columnar format and then we pivot all the rows into one with the values as different columns.

The below sample returns three rows of unique data, but the IDs are duplicated. I need just two rows with unique IDs (and the other columns that go along with it). I know I'll be losing some data, but I just need one matching row per ID to the query (first, top, oldest, newest, whatever).

I've tried using DISTINCT, GROUP BY, and ROW_NUMBER, but I keep getting the syntax wrong, or using them in the wrong place.

I'm also open to rewriting the query completely in a way that is reusable as I currently have to generate this on the fly (cardtypes and cardindexes are user defined) and would love to be able to create a stored procedure. Thanks in advance!

declare @cardtypes table ([ID] int, [Name] nvarchar(50))
declare @cards table ([ID] int, [CardTypeID] int, [Name] nvarchar(50))
declare @cardindexes table ([ID] int, [CardID] int, [IndexType] int, [StringVal] nvarchar(255), [DateVal] datetime)

INSERT INTO @cardtypes VALUES (1, 'Funny Cards')
INSERT INTO @cardtypes VALUES (2, 'Sad Cards')

INSERT INTO @cards VALUES (1, 1, 'Bunnies')
INSERT INTO @cards VALUES (2, 1, 'Dogs')
INSERT INTO @cards VALUES (3, 1, 'Cat')
INSERT INTO @cards VALUES (4, 1, 'Cat2')

INSERT INTO @cardindexes VALUES (1, 1, 1, 'Bunnies', null)
INSERT INTO @cardindexes VALUES (2, 1, 1, 'playing', null)
INSERT INTO @cardindexes VALUES (3, 1, 2, null, '2014-09-21')
INSERT INTO @cardindexes VALUES (4, 2, 1, 'Dogs', null)
INSERT INTO @cardindexes VALUES (5, 2, 1, 'playing', null)
INSERT INTO @cardindexes VALUES (6, 2, 1, 'poker', null)
INSERT INTO @cardindexes VALUES (7, 2, 2, null, '2014-09-22')


SELECT TOP(100)
    [ID] = c.[ID],
    [Name] = c.[Name],
    [Keyword] = [colKeyword].[StringVal],
    [DateAdded] = [colDateAdded].[DateVal]
FROM @cards AS c
LEFT JOIN @cardindexes AS [colKeyword] ON [colKeyword].[CardID] = c.ID AND [colKeyword].[IndexType] = 1
LEFT JOIN @cardindexes AS [colDateAdded] ON [colDateAdded].[CardID] = c.ID AND [colDateAdded].[IndexType] = 2
WHERE [colKeyword].[StringVal] LIKE 'p%' AND c.[CardTypeID] = 1
ORDER BY [DateAdded]

Edit:

While both solutions are valid, I ended up using the MAX() solution from @popovitsj as it was easier to implement. The issue of data coming from multiple rows doesn't really factor in for me as all rows are essentially part of the same record. I will most likely use both solutions depending on my needs.

Here's my updated query (as it didn't quite match the answer):

SELECT TOP(100)
    [ID] = c.[ID],
    [Name] = MAX(c.[Name]),
    [Keyword] = MAX([colKeyword].[StringVal]),
    [DateAdded] = MAX([colDateAdded].[DateVal])
FROM @cards AS c
LEFT JOIN @cardindexes AS [colKeyword] ON [colKeyword].[CardID] = c.ID AND [colKeyword].[IndexType] = 1
LEFT JOIN @cardindexes AS [colDateAdded] ON [colDateAdded].[CardID] = c.ID AND [colDateAdded].[IndexType] = 2
WHERE [colKeyword].[StringVal] LIKE 'p%' AND c.[CardTypeID] = 1
GROUP BY c.ID
ORDER BY [DateAdded]
Jim Billig
  • 344
  • 2
  • 14

2 Answers2

2

using row number windowed function along with a CTE will do this pretty well. For example:

;With preResult AS (
SELECT TOP(100)
    [ID] = c.[ID],
    [Name] = c.[Name],
    [Keyword] = [colKeyword].[StringVal],
    [DateAdded] = [colDateAdded].[DateVal],
    ROW_NUMBER()OVER(PARTITION BY c.ID ORDER BY [colDateAdded].[DateVal]) rn
FROM @cards AS c
LEFT JOIN @cardindexes AS [colKeyword] ON [colKeyword].[CardID] = c.ID AND [colKeyword].[IndexType] = 1
LEFT JOIN @cardindexes AS [colDateAdded] ON [colDateAdded].[CardID] = c.ID AND [colDateAdded].[IndexType] = 2
WHERE [colKeyword].[StringVal] LIKE 'p%' AND c.[CardTypeID] = 1
ORDER BY [DateAdded]
)

SELECT * from preResult WHERE rn = 1
Ryan B.
  • 3,575
  • 2
  • 20
  • 26
  • This appears to work. Can you tell me what the drawbacks of doing it this way are? – Jim Billig Nov 17 '14 at 21:34
  • 1
    This is a much better technique of picking one row out of many in the presence of duplicates than the MAX coupled with GROUP BY technique. The MAX coupled with GROUP BY technique can easily end up delivering a merge of two or more rows, and users aren't likely to want or expect that. If they want and expect a merge, which I doubt, then go the MAX / GROUP BY route. I use ROW_NUMBER() a lot instead. On DB2 we can also use a LATERAL join to accomplish the same thing, which is another excellent tool to pick one row out of many, but I think SQL Server calls it something different. – Mike Jones Nov 18 '14 at 02:00
2

You could use MAX or MIN to 'decide' on what to display for the other columns in the rows that are duplicate.

SELECT ID, MAX(Name), MAX(Keyword), MAX(DateAdded)
(...)
GROUP BY ID;
wvdz
  • 16,251
  • 4
  • 53
  • 90
  • I'd also add ORDER BY MAX(DateAdded) to keep the same order. Actually I think the question is duplicate to http://stackoverflow.com/questions/5391564/how-to-use-distinct-and-order-by-in-same-select-statement – sarh Nov 17 '14 at 21:02
  • This can mix the data though... the row that you get back with your id might not actually match any of the original rows. – Ryan B. Nov 17 '14 at 21:03
  • True, it depends on the exact requirements if this is a good solution. – wvdz Nov 17 '14 at 21:11
  • @popovitsj I've tried your way and it appears to work. Can you give me pros and cons to this vs. CTE/ROW_NUMBER() as suggested by Greenspark? Also, I don't mind the order as ultimately they are all part of the same record, I just can utilize the data in more than one row. – Jim Billig Nov 17 '14 at 23:18
  • @JimBillig The pros of this solution are better readibility and performance. The only con is as mentioned by Greenspark: it scrambles the data: it might take the Keyword from row 1 and the DateAdded from row 2, whichever happens to be the max value. – wvdz Nov 18 '14 at 13:10