1

I have 2 tables with several identical fields needed to be linked in JOIN condition. E.g. in each table there are fields: P1, P2. I want to write the following join query:

SELECT ... FROM Table1
   INNER JOIN
   Table2
      ON    Table1.P1 = Table2.P1
         OR Table1.P2 = Table2.P2
         OR Table1.P1 = Table2.P2
         OR Table1.P2 = Table2.P1

In the case I have huge tables this request is executing a lot of time.

I tried to test how long will be the request of a query with one condition only. First, I have modified the tables in such way all data from P2 & P1 where copied as new rows into Table1 & Table2. So my query is simple:

SELECT ... FROM Table1 INNER JOIN Table2 ON Table1.P = Table2.P

The result was more then surprised: the execution time from many hours (the 1st case) was reduced to 2-3 seconds!

Why is it so different? Does it mean the complex conditions are always reduce performance? How can I improve the issue? May be P1,P2 indexing will help? I want to remain the 1st DB schema and not to move to one field P.

Ben
  • 51,770
  • 36
  • 127
  • 149
RedSoft
  • 373
  • 2
  • 15
  • Are those `OR`s in join condition supposed to be `AND` or is that intentional? – Vikdor Sep 09 '12 at 11:13
  • By the looks of it, it is intentional. The match is where Table1's P1 or P2 equals table2's P1 or P2 – DarK Sep 09 '12 at 11:17
  • 4
    possible duplicate of [Is having an 'OR' in an INNER JOIN condition a bad idea?](http://stackoverflow.com/questions/5901791/is-having-an-or-in-an-inner-join-condition-a-bad-idea) – Martin Smith Sep 09 '12 at 11:25

3 Answers3

5

The reason the queries are different is because of the join strategies being used by the optimizer. There are basically four ways that two tables can be joined:

  1. "Hash join": Creates a hash table on one of the tables which it uses to look up the values in the second.
  2. "Merge join": Sorts both tables on the key and then readsthe results sequentially for the join.
  3. "Index lookup": Uses an index to look up values in one table.
  4. "Nested Loop": Compars each value in each table to all the values in the other table.

(And there are variations on these, such as using an index instead of a table, working with partitions, and handling multiple processors.) Unfortunately, in SQL Server Management Studio both (3) and (4) are shown as nested loop joins. If you look more closely, you can tell the difference from the parameters in the node.

In any case, your original join is one of the first three -- and it goes fast. These joins can basically only be used on "equi-joins". That is, when the condition joining the two tables includes an equality operator.

When you switch from a single equality to an "in" or set of "or" conditions, the join condition has changed from an equijoin to a non-equijoin. My observation is that SQL Server does a lousy job of optimization in this case (and, to be fair, I think other databases do pretty much the same thing). Your performance hit is the hit of going from a good join algorithm to the nested loops algorithm.

Without testing, I might suggest some of the following strategies.

  1. Build an index on P1 and P2 in both tables. SQL Server might use the index even for a non-equijoin.
  2. Use the union query suggested in another solution. Each query should be correctly optimized.
  3. Assuming these are 1-1 joins, you can also do this as a set of multiple joins:

    from table1 t1 left outer join table2 t2_11 on t1.p1 = t2_11.p1 left outer join table2 t2_12 on t1.p1 = t2_12.p2 left outer join table2 t2_21 on t1.p2 = t2_21.p2 left outer join table2 t2_22 on t1.p2 = t2_22.p2

And then use case/coalesce logic in the SELECT to get the value that you actually want. Although this may look more complicated, it should be quite efficient.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
2

you can use 4 query and Union there result

SELECT ... FROM Table1
INNER JOIN
Table2
  ON    Table1.P1 = Table2.P1
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
  ON   Table1.P1 = Table2.P2
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
  ON    Table1.P2 = Table2.P1
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
  ON   Table1.P2 = Table2.P2
Kaushal
  • 471
  • 3
  • 15
  • This will get expanded out to the same query as the OP. It will still end up with a nested loops join. – Martin Smith Sep 09 '12 at 11:34
  • 1
    @Kaushal, 10x! The second advice is fast, so I will use it. But still don't understand whether this is only MS SQL or common problem? – RedSoft Sep 09 '12 at 13:38
  • +1 for the union solution, some rdbms will convert multiple OR criteria into unions automatically depending on other CBO variables – gordy Sep 09 '12 at 19:07
  • @Sergey -this is the issue of multiple 'OR'. also if that helps you, please don not forget to accept the answer, so that it will help others also. – Kaushal Sep 10 '12 at 03:58
  • @Kaushal - how do I accept the answer? Is it voting up? - It says I cannot do it because my reputation is low. Is it answering to "Was this post useful to you?" - Yes, I have answered. May be something else? – RedSoft Sep 11 '12 at 14:44
0

Does using CTEs help performance?

;WITH Table1_cte 
AS
(
SELECT 
      ...
      [P] = P1
FROM Table1
UNION   
SELECT 
      ...
      [P] = P2
FROM Table1
)
, Table2_cte 
AS
(
SELECT 
      ...
      [P] = P1
FROM Table2
UNION   
SELECT 
      ...
      [P] = P2
FROM Table2
)
SELECT ... FROM Table1_cte x
   INNER JOIN
   Table2_cte y
      ON x.P = y.P

I suspect, as far as the processor is concerned, the above is just different syntax for the same complex conditions.

whytheq
  • 34,466
  • 65
  • 172
  • 267