SQL inner join vs subquery

Question

I am working on following queries:

Query 1: SELECT * From TabA INNER JOIN TabB on TabA.Id=TabB.Id
Query 2: SELECT * From TabA WHERE Id in (SELECT Id FROM TabB)
Query 3: SELECT TabA.* From TabA INNER JOIN TabB on TabA.Id=TabB.Id

I investigate these queries with SQL Server profiler and found some interesting facts.

Query 1 takes 2.312 Seconds
Query 2 takes 0.811 Seconds
Query 3 takes 0.944 Seconds

TabA 48716 rows

TabB 62719 rows

Basically what I am asking is why Query 1 is taking long time, not Query 3. I already know that 'sub query' is slower than inner join but here Query 2 is fastest; why?

Did you run these timings multiple times? Query performance is greatly affected by whether or not the tables are loaded into the page cache. — Gordon Linoff, May 02 '13 at 14:03
be careful with cache. Are you cleaning it after every query? — Gonzalo.-, May 02 '13 at 14:03
Query 1 is slower then Query 3 because you do not load the data from TabB in Query 3 — AgeDeO, May 02 '13 at 14:04
For fun, you should test it with a `CROSS APPLY` too: `SELECT * From TabA a CROSS APPLY (SELECT * FROM TabB WHERE Id = a.Id) b` — valverij, May 02 '13 at 14:04
Query 1 have all the columns from Table A and B but Query 3 only the column of Table A. That's why the query 3 is faster than query 1 — Deepak.Aggrawal, May 02 '13 at 14:05
Query A is returning all columns from TabA and TabB. Queries 2 and 3 return columns from only TabB. Number of columns in resultset, and type / size of each may have something to do with it. How are you determining your profiler times? Are they "total roundtrip" times? If so, the total roundtrip time would be greater for query 1 since more data needs to be passed over the wire. — Shan Plourde, May 02 '13 at 14:07
possible duplicate [here](http://stackoverflow.com/questions/4799820/when-to-use-sql-sub-queries-versus-a-standard-join) — Shafqat Masood, May 02 '13 at 14:07
Here is reads from sql server: Query 1: 231240 reads, Query 2: 1586 reads, Query 3: 1594 reads. — Mohsin JK, May 02 '13 at 14:16
Assuming that a read is required to transmit data over the wire, TabB probably contains a large number of columns, with several set to large data types. Is that so? Only Query A requires all columns and values to be included in the result set and transmitted over the wire. What do the performance times in seconds represent? Total client roundtrip time? — Shan Plourde, May 02 '13 at 14:49
Are you running each query 4 or 5 times and averaging the results? The difference could be due to caching or just server load at the time each was run. Also, you do understand that SQL doesn't run the query you typed but translates it to code, which may very well be identical for queries with identical results, so your subquery is slower theory isn't that solid. — JohnFx, May 03 '13 at 05:12
You can try `SELECT * From TabA WHERE EXISTS (SELECT Id FROM TabB WHERE TabA.Id=TabB.Id)` — Justin, May 03 '13 at 05:37
It's probably just caching. Try clearing the cache then running that in the reverse order, or clearing the cache each time. — Nick.Mc, May 03 '13 at 06:19
@Nick.McDermaid It's also possible to avoid the cache by using a keyword on some DBMS. In MySQL for example this is done by adding SQL_NO_CACHE after SELECT. — Alexander Behling, Mar 05 '21 at 10:54

Kenneth Fisher · Accepted Answer · 2013-05-03T03:23:48.283

If I had to guess I would say it's because query 1 is pulling the data from both tables. Queries 2 and 3 (aprox the same time) are only pulling data for TabA.

One way you could check this is by running the following:

SET STATISTICS TIME ON
SET STATISTICS IO ON

When I ran

SELECT * FROM sys.objects

I saw the following results.

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 104 ms.

(242 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'sysschobjs'. Scan count 1, logical reads 10, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'syssingleobjrefs'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'syspalnames'. Scan count 1, logical reads 2, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 866 ms.

You can take a look at # of scans, logical reads and physical reads for each query. Physical reads of course take much longer and represent reading from the disk into the cache. If all of your reads are logical reads then your table is completely in cache.

I would be willing to bet if you look you will see a lot more logical reads on TabB on query 1 than on 2 and 3.

EDIT:

Just out of curiosity I did some tests and blogged the results here.

score 2 · Answer 2 · answered May 02 '13 at 14:43

Query 1:
This query is returning rows from all rows in both TabA and TabB so the covering index for both tables requires all rows from each table to be included. To see exactly what's going on you'd want to look at the query plan.

Query 2 & Query 3:
You're returning all rows from TabA and you only need the index for the Id column for TabB. I'm guessing the difference here has something to do with the table statistics, but (once again) we'd have to see the query plan to know exactly what's going on.

score 1 · Answer 3 · answered May 02 '13 at 14:04

1

It's simply because SQL doesn't have to perform a JOIN. You are just performing two queries, and only one of them has a WHERE clause.

I must admit I didn't expect SUCH a big difference.

answered May 02 '13 at 14:04

Captain Kenpachi

6,960
7
47
68

Nice! but sub queries are supposed to be slower than inner join queries. [Reference](http://stackoverflow.com/questions/141278/subqueries-vs-joins) – Mohsin JK May 02 '13 at 14:27
That's what the textbook says, yes, but it makes more sense that the less happening behind the scenes, the quicker you get your results. I wonder what would happen if you created a view from your JOIN query and performance tested that. By the way, Views or Stored Procedures are much faster than regular JOIN queries because the execution plan is stored. – Captain Kenpachi May 02 '13 at 14:44

juanora · Answer 4 · 2017-12-27T15:42:08.233

0

If the join is one to many, is possible that what is taking time is the repeated data. Instead, you can format a set of related rows as JSON array. Check the "Use Case 1" on https://blogs.msdn.microsoft.com/sqlserverstorageengine/2015/10/09/returning-child-rows-formatted-as-json-in-sql-server-queries/

edited Dec 27 '17 at 15:42

answered Dec 27 '17 at 15:14

juanora

542
10
21

SQL inner join vs subquery

4 Answers4