SQL - remove duplicates from left join

Question

I'm creating a joined view of two tables, but am getting unwanted duplicates from table2.
For example: table1 has 9000 records and I need the resulting view to contain exactly the same; table2 may have multiple records with the same FKID but I only want to return one record (random chosen is ok with my customer). I have the following code that works correctly, but performance is slower than desired (over 14 seconds).

SELECT     
    OBJECTID
    , PKID
    ,(SELECT TOP (1) SUBDIVISIO
        FROM dbo.table2 AS t2
        WHERE (t1.PKID = t2.FKID)) AS ProjectName
    ,(SELECT TOP (1) ASBUILT1
        FROM dbo.table2 AS t2
        WHERE (t1.PKID = t2.FKID)) AS Asbuilt
FROM dbo.table1 AS t1

Is there a way to do something similar with joins to speed up performance?
I'm using SQL Server 2008 R2.
I got close with the following code (~.5 seconds), but 'Distinct' only filters out records when all columns are duplicate (rather than just the FKID).

SELECT
    t1.OBJECTID
    ,t1.PKID
    ,t2.ProjectName
    ,t2.Asbuilt
FROM dbo.table1 AS t1
    LEFT JOIN (SELECT
        DISTINCT FKID
        ,ProjectName
        ,Asbuilt
        FROM dbo.table2) t2
    ON t1.PKID = t2.FKID

table examples

table1          table2

OID, PKID       FKID, ProjectName, Asbuilt
1, id1          id1, P1, AB1
2, id2          id1, P5, AB5
3, id4          id2, P10, AB2
5, id5          id5, P4, AB4

In the above example returned records should be id5/P4/AB4, id2/P10/AB2, and (id1/P1/AB1 OR id1/P5/AB5)

My search came up with similar questions, but none that resolved my problem. link, link
Thanks in advance for your help. This is my first post so let me know if I've broken any rules.

Wouldn't adding DISTINCT to the first SELECT fix this? – Corey Jan 22 '22 at 03:05 — Corey, Jan 22 '22 at 03:05

score 14 · Accepted Answer · answered Jan 30 '13 at 22:41

14

This will give the results you requested and should have the best performance.

SELECT     
    OBJECTID
    , PKID
    , t2.SUBDIVISIO,
    , t2.ASBUILT1

FROM        dbo.table1 AS t1
OUTER APPLY (
    SELECT  TOP 1 *
    FROM    dbo.table2 AS t2
    WHERE   t1.PKID = t2.FKID
    ) AS t2

answered Jan 30 '13 at 22:41

RBarryYoung

55,398
14
96
137

this works also! I'll have to look further into the OUTER APPLY. – Rick Momsen Jan 30 '13 at 23:07
@RickMomsen Just out of curiosity, how fast is it now? – RBarryYoung Jan 30 '13 at 23:08
This is my solution since it consistently returns an entire record. – Rick Momsen Jan 30 '13 at 23:22
6891ms compared with 7581ms (my sub select query). I should note that I have to run this against a view rather than the source table. Your query is faster than the others (mentioned so far) when ran against either the view or the source table; except Gordon's option2 had little performance difference between table or view – Rick Momsen Jan 30 '13 at 23:46

score 3 · Answer 2 · answered Jan 30 '13 at 22:32

3

Your original query is producing arbitrary values for the two columns (the use of top with no order by). You can get the same effect with this:

SELECT t1.OBJECTID, t1.PKID, t2.ProjectName, t2.Asbuilt
FROM dbo.table1 t1 LEFT JOIN
     (SELECT FKID, min(ProjectName) as ProjectName, MIN(asBuilt) as AsBuilt
      FROM dbo.table2
      group by fkid
     ) t2
    ON t1.PKID = t2.FKID

This version replaces the distinct with a group by.

To get a truly random row in SQL Server (which your syntax suggests you are using), try this:

SELECT t1.OBJECTID, t1.PKID, t2.ProjectName, t2.Asbuilt
FROM dbo.table1 t1 LEFT JOIN
     (SELECT FKID, ProjectName, AsBuilt,
             ROW_NUMBER() over (PARTITION by fkid order by newid()) as seqnum
      FROM dbo.table2
     ) t2
    ON t1.PKID = t2.FKID and t2.seqnum = 1

This assumes version 2005 or greater.

answered Jan 30 '13 at 22:32

Gordon Linoff

1,242,037
58
646
786

Thank you Gordon! I tried option1 but the group by kept failing b/c the other fields were not aggregated. I also tried option2 but couldn't get it to work. I'm going to go with option1 because of the consistent output. – Rick Momsen Jan 30 '13 at 23:02
Sorry, but I have to uncheck your answer. the "MIN" in the sub select actually mixes values between the records and I prefer to have a complete record returned. This way the client won't get confused on which record needs to be updated. I appreciate the education! – Rick Momsen Jan 30 '13 at 23:19
I thought it would be worthy to note that your option2 is far faster than other queries when ran against a view. (364ms vs 7581ms and 6891ms) – Rick Momsen Jan 30 '13 at 23:47
@RickMomsen . . . Your original query also mixed results, which is why I wrote it that way (and I tried to explain that in the first sentence). I'm a bit surprised that the second version runs faster, but the window functions are highly optimized. – Gordon Linoff Jan 31 '13 at 00:10
You're right Gordon. The performance concern stopped me before I investigated the results and knew that would be an issue. Thanks again! – Rick Momsen Feb 01 '13 at 18:11

score 1 · Answer 3 · answered Jan 30 '13 at 22:51

1

If you want described result, you need to use INNER JOIN and following query will satisfy your need:

SELECT
  t1.OID,
  t1.PKID,
  MAX(t2.ProjectName) AS ProjectName,
  MAX(t2.Asbuilt) AS Asbuilt
FROM table1 t1
JOIN table2 t2 ON t1.PKID = t2.FKID
GROUP BY
  t1.OID,
  t1.PKID

If you want to see all rows from left table (table1) whether it has pair in right table or not, then use LEFT JOIN and same query will gave you desired result.

EDITED

This construction has good performance, and you dont need to use subqueries.

answered Jan 30 '13 at 22:51

veljasije

6,722
12
48
79

1

This works too, but has the same MIN/MAX issues as Gordon's Option1 above. – Rick Momsen Jan 30 '13 at 23:21
I agree, but I am using simple JOIN instead derived table. – veljasije Jan 30 '13 at 23:33

SQL - remove duplicates from left join

3 Answers3