4

I have a large historical transaction table (15-20 million rows MANY columns) and a table with one row one column. The table with one row contains a date (last processing date) which will be used to pull the data in the trasaction table ('process_date').

Question: Should I inner join the 'process_date' table to the transaction table or the transaction table to the 'process_date' table?

Keng
  • 52,011
  • 32
  • 81
  • 111
  • 2
    I would be very surprised if it would result in different execution plans. – Lieven Keersmaekers May 21 '10 at 14:23
  • 2
    Just checked with two of our tables. It returns the exact same execution plan. To answer your question, it doesn't matter. – Lieven Keersmaekers May 21 '10 at 14:25
  • Does the execution plan have a direct correlation to efficientcy? That is if 2 plans are exactly the same is the execution time exactly the same? – Keng May 21 '10 at 14:48
  • 1
    If there's really only ever going to be one row/column in the table then I probably wouldn't `JOIN` at all, I'd `SELECT` the value into a variable and then write `WHERE process_date = @last_process_date`. – Aaronaught May 21 '10 at 15:19

4 Answers4

6

This is how I would do it

SELECT <<list only columns you need>> 
FROM large_historical_transaction_table t
WHERE EXISTS (SELECT 1 FROM OneRowTable o 
              WHERE o.last_processing_date = t.process_date)
SQLMenace
  • 132,095
  • 25
  • 206
  • 225
  • I'm sure you know what you're talking about, with a name like SQLMenace and a rep of 17k, but on first glance, it looks like correlated subquery will be executed for each row in the large table? – mdma May 21 '10 at 14:27
  • +1, @mdma, it looks that way, but this is faster than the regular join – KM. May 21 '10 at 14:33
  • Thanks for clarification. I'm interested to get into the details of how it's faster - do you know of links describing this in more detail? – mdma May 21 '10 at 15:02
  • 1
    @mdma: Generally speaking, a `JOIN` implies returning **all** matching rows whereas `EXISTS` only needs to find the first one. Although in this case it doesn't matter because there's only one row in the lookup table to begin with. – Aaronaught May 21 '10 at 15:15
  • +1, even though I don't get it! @Aaronaught - I kind of see where you are coming from, but here there is only one row, so what's the actual difference in practice? – mdma May 21 '10 at 15:17
  • I just tried this in SQL server. Using an INNER JOIN and the subselect - the actual query plans are identical, so the subquery wasn't faster. – mdma May 21 '10 at 15:22
  • 2
    @mdma: With only one row, there probably is no difference in practice. Still, it's good to get into the habit of using `IN` or `EXISTS` when you don't actually need a true `JOIN` - oftentimes (but not always) it will lead to a more efficient plan. Especially on, shall we say, *differently-abled* DBMSes like mysql. – Aaronaught May 21 '10 at 15:24
  • Differently-abled! hehe! :-) Thanks for the advice, I'll keep that in mind. I kind of have a wired in knee jerk reaction when relating two tables to use a join, I'll have to unlearn that habit. – mdma May 21 '10 at 15:48
3

An Inner join is a symmetrical, bi-directional relationship, in general it doesn't matter, but in this case I would suggest not joining at all, read the threshold date into a variable and pass it to the other select query as a parameter...

Charles Bretana
  • 143,358
  • 22
  • 150
  • 216
2

For readability I would inner join from the transaction table to explicitly indicate that the second table with the date just acts like a filter.

Raj
  • 1,742
  • 1
  • 12
  • 17
2

When joining tables, the query optimizer takes a quick sniff of both to determine the most appropriate join implementation. Logically the inner join is symmetric, but the implementation may favor one side over the other for improved performance.

mdma
  • 56,943
  • 12
  • 94
  • 128