Why do multiple-table joins produce duplicate rows?

Question

Let's say I have three tables A, B, and C. Each has two columns: a primary key and some other piece of data. They each have the same number of rows. If I JOIN A and B on the primary key, I should end up with the same number of rows as are in either of them (as opposed to A.rows * B.rows).

Now, if I JOIN A JOIN B with C, why do I end up with duplicate rows? I have run into this problem on several occasions and I do not understand it. It seems like it should produce the same result as JOINing A and B since it has the same number of rows but, instead, duplicates are produced.

Queries that produce results like this are of the format

SELECT *
FROM M
    INNER JOIN S
        on M.mIndex = S.mIndex
    INNER JOIN D
        ON M.platformId LIKE '%' + D.version + '%'
    INNER JOIN H
        ON D.Name = H.Name
        AND D.revision = H.revision

Here are schemas for the tables. H contains is a historic table containing everything that was ever in D. There are many M rows for each D and one S for each M.

Table M

    [mIndex] [int] NOT NULL PRIMARY KEY,
    [platformId] [nvarchar](256) NULL,
    [ip] [nvarchar](64) NULL,
    [complete] [bit] NOT NULL,
    [date] [datetime] NOT NULL,
    [DeployId] [int] NOT NULL PRIMARY KEY REFERENCES D.DeployId,
    [source] [nvarchar](64) NOT NULL PRIMARY KEY

Table S

[order] [int] NOT NULL PRIMARY KEY,
[name] [nvarchar](64) NOT NULL,
[parameters] [nvarchar](256) NOT NULL,
[Finished] [bit] NOT NULL,
[mIndex] [int] NOT NULL PRIMARY KEY,
[mDeployId] [int] NOT NULL PRIMARY KEY,
[Date] [datetime] NULL,
[status] [nvarchar](10) NULL,
[output] [nvarchar](max) NULL,
[config] [nvarchar](64) NOT NULL PRIMARY KEY

Table D

[Id] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY,
[branch] [nvarchar](64) NOT NULL,
[revision] [int] NOT NULL,
[version] [nvarchar](64) NOT NULL,
[path] [nvarchar](256) NOT NULL

Table H

[IdDeploy] [int] IDENTITY(1,1) NOT NULL,
[name] [nvarchar](64) NOT NULL,
[version] [nvarchar](64) NOT NULL,
[path] [nvarchar](max) NOT NULL,
[StartDate] [datetime] NOT NULL,
[EndDate] [datetime] NULL,
[Revision] [nvarchar](64) NULL,

I didn't post the tables and query initially because I am more interested in understanding this problem for myself and avoiding it in the future.

You may be making a Cartesian join. The wiki page has examples of this and other types of joins - http://en.wikipedia.org/wiki/Join_(SQL)#Cross_join — Joseph B, May 21 '14 at 14:42
@sirdank That's not a useful example without an additional sample of the data you're querying. — Bob Tway, May 21 '14 at 14:44
@JosephB I don't believe that's the case. Please see my revised question. — sirdank, May 21 '14 at 14:45
@MattThrower Please forgive me, I do not understand. What can I do (aside from posting the contents of my database, which I cannot) to help you? — sirdank, May 21 '14 at 14:48
@sirdank You post a small selection of the data. Or, at the very least, the table definitions. But to be honest your question is really about the mechanics of how joins work rather than a specific problem. — Bob Tway, May 21 '14 at 14:50
@MattThrower That is on purpose. I am intentionally asking about the mechanics of joins in order to understand them better. I want to learn to fish instead of ask SO for a fish whenever I get hungry. I'll post the table definitions and update the sample query. — sirdank, May 21 '14 at 14:51
This is a strange example. You want all tables's keys to match? Provided they really are the tables' IDs as their names suggest, they would be unique and you couldn't posiibly get any duplicates. I think you have given a wrong example. — Thorsten Kettner, May 21 '14 at 14:58
If one of the tables M, S, D, or H has more than one row for a given Id (if just the Id column is not the PK), then the query would result in "duplicate" rows. If you have more than one row for an Id in a table, then the other columns, which would uniquely identify a row, also must be included in the JOIN condition(s). — Joseph B, May 21 '14 at 14:58
@JosephB You may have solved my problem. I think the answer is that I need to use all keys in the join condition. I didn't realize that was necessary or that I wasn't doing it. It makes sense though. If you post that as an answer, I'll accept it. — sirdank, May 21 '14 at 15:33
@sirdank Yes, please use all the key columns in your join condition to avoid "duplicate" rows. I have included my comment as an answer below. — Joseph B, May 21 '14 at 16:29

score 81 · Answer 1 · edited Jun 22 '17 at 11:58

When you have related tables you often have one-to-many or many-to-many relationships. So when you join to TableB each record in TableA many have multiple records in TableB. This is normal and expected.

Now at times you only need certain columns and those are all the same for all the records, then you would need to do some sort of group by or distinct to remove the duplicates. Let's look at an example:

TableA
Id Field1
1  test
2  another test

TableB
ID Field2 field3
1  Test1  something
1  test1  More something
2  Test2  Anything

So when you join them and select all the files you get:

select * 
from tableA a 
join tableb b on a.id = b.id

a.Id a.Field1        b.id   b.field2  b.field3
1    test            1      Test1     something
1    test            1      Test1     More something
2    another test 2  2      Test2     Anything

These are not duplicates because the values of Field3 are different even though there are repeated values in the earlier fields. Now when you only select certain columns the same number of records are being joined together but since the columns with the different information is not being displayed they look like duplicates.

select a.Id, a.Field1,  b.field2
from tableA a 
join tableb b on a.id = b.id

a.Id a.Field1       b.field2  
1    test           Test1     
1    test           Test1 
2    another test   Test2

This appears to be duplicates but it is not because of the multiple records in TableB.

You normally fix this by using aggregates and group by, by using distinct or by filtering in the where clause to remove duplicates. How you solve this depends on exactly what your business rule is and how your database is designed and what kind of data is in there.

I was missing a "Group By" argument & this helped me. Thanks for explaining — Derek, Nov 02 '21 at 12:45

score 49 · Accepted Answer · answered May 21 '14 at 16:26

If one of the tables M, S, D, or H has more than one row for a given Id (if just the Id column is not the Primary Key), then the query would result in "duplicate" rows. If you have more than one row for an Id in a table, then the other columns, which would uniquely identify a row, also must be included in the JOIN condition(s).

References:

Why do multiple-table joins produce duplicate rows?

6 Answers6

Linked