-1

I am trying to do some left joins on multiple tables and facing the following issue.

Row Counts of tables

Table 1: 1.6M
Table 2: 1.7M
Table 3: 1.5M

When I am doing left Join using Table 1 and 2 and following query, I get data count as 1.8 M (acceptable):

SELECT Table1.ID1, Table1.ID2, Table2.Name, Table2.City
FROM Table1
LEFT JOIN Table2
    ON Table1.ID1 = Table2.ID1
    AND Table1.ID2 = Table2.ID2
    AND Table1.Source_System = Table2.Source_System
;

Similarly when I am doing left Join using Table 1 and 3 and following query, I get data count as 1.9 M (acceptable):

SELECT Table1.ID1, Table1.ID2, Table3.Name, Table3.City 
FROM Table1
LEFT JOIN Table3
    ON Table1.ID1 = Table3.ID1 
    AND Table1.ID2 = Table3.ID2
    AND Table1.Source_System = Table3.Source_System
;

But when I am doing left Join using Table 1, 2 and 3 and following query, I get data count as 11.9 G (ISSUE):

SELECT 
    Table1.ID1, Table1.ID2,
    Table2.Name, Table2.City,
    Table3.Name as Name1, Table3.City as City1
FROM Table1
LEFT JOIN Table2
    ON Table1.ID1 = Table2.ID1
    AND Table1.ID2 = Table2.ID2
    AND Table1.Source_System = Table2.Source_System
LEFT JOIN Table3
    ON Table1.ID1 = Table3.ID1
    AND Table1.ID2 = Table3.ID2
    AND Table1.Source_System = Table3.Source_System
;
Simeon Pilgrim
  • 22,906
  • 3
  • 32
  • 45
Ishan Juneja
  • 401
  • 4
  • 11
  • Please in code questions give a [mre]--cut & paste & runnable code, including smallest representative example input as code; desired & actual output (including verbatim error messages); tags & versions; clear specification & explanation. Give the least code you can that is code that you show is OK extended by code that you show is not OK. (Debugging fundamental.) For SQL that includes DBMS & DDL (including constraints & indexes) & input as code formatted as a table. [ask] Pause work on the overall goal, chop code to the 1st expression not giving what you expect & say what you expect & why. – philipxy Sep 30 '20 at 19:43
  • 2
    If you get more rows than you expect, then there probably is a problem in the join conditions. You would need to provide a minimal reproducible example that demonstrates the problem (otherwise the question is too broad, and can't be answered accurately): sample data, current and desired results as tabular text. – GMB Sep 30 '20 at 20:45
  • They are different because that is how the language is defined. Say why you think they are the same. If you can't then you have no reason to think that you have an opinion on whether they are the same so why do you? And you are just asking for yet another presentation of the language to misunderstand while not telling us what your reasoning is for us to address your misconceptions. (You will also probably find your own misconceptions while finding your first unexpected subexpression result while constructing your *obligatory*.) – philipxy Oct 03 '20 at 05:31
  • Does this answer your question? [Two SQL LEFT JOINS produce incorrect result](https://stackoverflow.com/questions/12464037/two-sql-left-joins-produce-incorrect-result) – philipxy Nov 29 '22 at 05:44

4 Answers4

1

So it seems you have assumed the data in table1 and table2 join in a 1:1 ratio, and also assumed the table1 and table3 are also a 1:1 ratio, so assumed when those three tables joined, that ration should be in the order again of 1:1

But if half you entries in table1 are not in table2 to get the 1.8M result, the the common rows would have to be duplicated > 2.0 times that increase. If we change that from half not matching to a tenth not matching there would need to be > 10.0 duplicates. Thus to get the 4 magnitude growth you have, it seems like you have only 100th match, but greater than 100.0 duplicates, which when cross joined give the 10,000 growth in rows.

this could be seen via:

SELECT Table1.ID1, Table1.ID2, Table1.Source_System, counnt(*) as counts
FROM Table1
LEFT JOIN Table2
    ON Table1.ID1 = Table2.ID1
        AND Table1.ID2 = Table2.ID2
        AND Table1.Source_System = Table2.Source_System
GROUP BY 1,2,3 
ORDER BY counts DESC
;

this will show the total distinct pairs, and which are the worst contributors to the combination explosion

Simeon Pilgrim
  • 22,906
  • 3
  • 32
  • 45
0

When your left join is producing more records than the referenced table it should not be acceptable! that should signal warning in your join condition and data. Either you investigate those records in the table to avoid it in the first place or you would need to keep tweaking your SQL to satisfy clean join that produces exact reference table row count. otherwise, it is very common that left joining to another table with a small duplicate records will produce exponential row count as you are facing here.

Try reading these questions here to help here and here

Just to add about investigating and finding those rows, use following SQL to find in each table what rows that have same ID1, ID2 and Source_System columns

i.e. :-

Select ID1, ID2 ,Source_System, COUNT(*) AS NUM_RECORDS_DUPS
FROM TABLE1
GROUP BY ID1, ID2 , Source_System 
HAVING COUNT(*)>1  -- Filtering on duplicate rows that has more than a row satisfying the join condition

Use the same for each of the tables to find those records and either add another unique condition/ aggregate the table on the joining keys or ask for data cleansing ! for those records

Monem_منعم
  • 336
  • 2
  • 10
  • A join that is not FK-PK is not necessarily a Cartesian product & is perfectly normal. A join on any condition that is not always true is not a Cartesian product anyway. Joins are not on keys, they are on conditions, and a Cartesian product is when the condition is always true, although the term is also abused to mean there is not a condition with a conjunct that involves columns from both tables. Most importantly, this all has nothing to do with the problem in the post. See my comments on the post. – philipxy Sep 30 '20 at 23:37
  • Okay, I think you are pretty precise in defining terms, and I may have abused the cartesian term as you said but I believe the explanation of the issue which starts with a left join producing more records than expected and how to investigate the issue is relevant to the post and pinpointing what needs to be done for fixing it in SQL or data – Monem_منعم Oct 01 '20 at 00:49
  • @philipxy I have edited the post and removed any cartesian reference, what do you think? does it read well now? – Monem_منعم Oct 01 '20 at 00:55
  • Your 1st sentence & your focus are just wrong. The issue is not anything about whether any join is 1:1-or-0, 1:M or M:M. It is about aggregating over wrong join(s) and making wrong join(s). When the wrong join happens to be M:M & the wrong one happens to not be, we might notice that the wrong one happens to have undesired multiple subrows. That's just sometimes a property of a wrong join. It is misconceived to pick one property of wrongness & say that's the problem. PS Constraints, including PKs, FKs or cardinalities, don't need to be known or hold in order to query. PS Cartesian links remain. – philipxy Oct 01 '20 at 03:01
0

Have you tried adding a DISTINCT clause?

SELECT DISTINCT columns, of, choice FROM Table1 LEFT JOIN Table2 on ... LEFT JOIN Table3 on ...

I think what's happening is you have dups that left join on another giant set of dups.

-2

Use the proper keys to join the two tables, it solves the issue.