Left Join without duplicate rows from left table

Question

Please look at the following query:

tbl_Contents

Content_Id  Content_Title    Content_Text
10002   New case Study   New case Study
10003   New case Study   New case Study
10004   New case Study   New case Study
10005   New case Study   New case Study
10006   New case Study   New case Study
10007   New case Study   New case Study
10008   New case Study   New case Study
10009   New case Study   New case Study
10010   SEO News Title   SEO News Text
10011   SEO News Title   SEO News Text
10012   Publish Contents SEO News Text

tbl_Media

Media_Id    Media_Title  Content_Id
1000    New case Study   10012
1001    SEO News Title   10010
1002    SEO News Title   10011
1003    Publish Contents 10012

QUERY

SELECT 
C.Content_ID,
C.Content_Title,
M.Media_Id

FROM tbl_Contents C
LEFT JOIN tbl_Media M ON M.Content_Id = C.Content_Id 
ORDER BY C.Content_DatePublished ASC

RESULT

10002   New case Study  2014-03-31 13:39:29.280 NULL
10003   New case Study  2014-03-31 14:23:06.727 NULL
10004   New case Study  2014-03-31 14:25:53.143 NULL
10005   New case Study  2014-03-31 14:26:06.993 NULL
10006   New case Study  2014-03-31 14:30:18.153 NULL
10007   New case Study  2014-03-31 14:30:42.513 NULL
10008   New case Study  2014-03-31 14:31:56.830 NULL
10009   New case Study  2014-03-31 14:35:18.040 NULL
10010   SEO News Title  2014-03-31 15:22:15.983 1001
10011   SEO News Title  2014-03-31 15:22:30.333 1002
10012   Publish         2014-03-31 15:25:11.753 1000
10012   Publish         2014-03-31 15:25:11.753 1003

10012 are coming twice...!

My query is returning duplicate rows from tbl_Contents (left table in the join)

Some rows in tbl_Contents has more than 1 associated rows in tbl_Media. I need all rows from tbl_Contents even if there are Null values exists in the tbl_Media BUT NO DUPLICATE RECORDS.

This is behaving as designed -- the rows aren't duplicated, they each have a different `media_id`. In your example, which row would you keep? — sgeddes, Mar 31 '14 at 18:45
I agree there is no problem here. Those are two different rows. — Zane, Mar 31 '14 at 18:48
Please check updated / corrected tbl_Media - now has Content_Id column, thanks — urooj.org, Mar 31 '14 at 18:56

eouw0o83hf · Accepted Answer · 2019-10-25T17:12:34.513

105

Try an OUTER APPLY

SELECT 
    C.Content_ID,
    C.Content_Title,
    C.Content_DatePublished,
    M.Media_Id
FROM 
    tbl_Contents C
    OUTER APPLY
    (
        SELECT TOP 1 *
        FROM tbl_Media M 
        WHERE M.Content_Id = C.Content_Id 
    ) m
ORDER BY 
    C.Content_DatePublished ASC

Alternatively, you could GROUP BY the results

SELECT 
    C.Content_ID,
    C.Content_Title,
    C.Content_DatePublished,
    M.Media_Id
FROM 
    tbl_Contents C
    LEFT OUTER JOIN tbl_Media M ON M.Content_Id = C.Content_Id 
GROUP BY
    C.Content_ID,
    C.Content_Title,
    C.Content_DatePublished,
    M.Media_Id
ORDER BY
    C.Content_DatePublished ASC

The OUTER APPLY selects a single row (or none) that matches each row from the left table.

The GROUP BY performs the entire join, but then collapses the final result rows on the provided columns.

edited Oct 25 '19 at 17:12

answered Mar 31 '14 at 18:45

eouw0o83hf

9,438
5
53
75

9

Using GROUP BY in this situation will result in rows being double counted if used in conjunction with aggregate functions like SUM or COUNT – nwhaught Sep 07 '16 at 15:53
Only if the aggregate functions do not appropriately account for that situation... – eouw0o83hf Sep 07 '16 at 17:25
2

I think you forgot to add an aggregate function to your select for M.Media_Id. That columns will not collapse by itself. In this form the query should give an error imho. – vargen_ Jul 04 '18 at 14:43
1

@vargen_ that is accurate. I edited the answer to reflect a correct aggregation – eouw0o83hf Jul 04 '18 at 23:44

score 27 · Answer 2 · answered Mar 31 '14 at 18:46

27

You can do this using generic SQL with group by:

SELECT C.Content_ID, C.Content_Title, MAX(M.Media_Id)
FROM tbl_Contents C LEFT JOIN
     tbl_Media M
     ON M.Content_Id = C.Content_Id 
GROUP BY C.Content_ID, C.Content_Title
ORDER BY MAX(C.Content_DatePublished) ASC;

Or with a correlated subquery:

SELECT C.Content_ID, C.Contt_Title,
       (SELECT M.Media_Id
        FROM tbl_Media M
        WHERE M.Content_Id = C.Content_Id
        ORDER BY M.MEDIA_ID DESC
        LIMIT 1
       ) as Media_Id
FROM tbl_Contents C 
ORDER BY C.Content_DatePublished ASC;

Of course, the syntax for limit 1 varies between databases. Could be top. Or rownum = 1. Or fetch first 1 rows. Or something like that.

answered Mar 31 '14 at 18:46

Gordon Linoff

1,242,037
58
646
786

As I see form OP the table `tbl_Media` hasn't column `Content_Id`. – Hamlet Hakobyan Mar 31 '14 at 18:50
Please check updated / corrected tbl_Media - now has Content_Id column, thanks – urooj.org Mar 31 '14 at 18:56
1

Wont the performance of the correlated sub query be really bad? I've come across this sort of query on big result sets and it was terrible. An additional query is being executed for every row matched. – mdon88 Feb 06 '15 at 10:30
1

@mdon88 . . . The performance should be fine with an index on `tbl_Media(ContentId, Media_id)`. – Gordon Linoff Feb 06 '15 at 13:06

score 11 · Answer 3 · answered Sep 05 '18 at 23:43

11

Using the DISTINCT flag will remove duplicate rows.

SELECT DISTINCT
C.Content_ID,
C.Content_Title,
M.Media_Id

FROM tbl_Contents C
LEFT JOIN tbl_Media M ON M.Content_Id = C.Content_Id 
ORDER BY C.Content_DatePublished ASC

answered Sep 05 '18 at 23:43

Victor Moxley

119
1
2

16

adding `DISTINCT` to queries can be highly inefficient (in postgress with 5000+ rows, I've seen a 10 fold increase in the query time) – Alexander Kleinhans Feb 13 '19 at 10:11

score 1 · Answer 4 · answered Feb 01 '22 at 19:25

The table tbl_media has the content_id 10012 appearing two times, so when tbl_content joins with tbl_media, it will capture 10012 twice, creating one row with media_id = 1003 and another one with media_id = 1000 (so if you consider both content_id and media_id, the information is actually not duplicated).

The question becomes, what is the correct media to prioritize for content 10012 (1000 or 1003)? Once you determine this relationship, you can adjust the IDs tbl_media and join the tables without creating duplicates on media_id level.

Left Join without duplicate rows from left table

4 Answers4

Linked