SQL Joins Vs SQL Subqueries (Performance)?

Question

I wish to know if I have a join query something like this -

Select E.Id,E.Name from Employee E join Dept D on E.DeptId=D.Id

and a subquery something like this -

Select E.Id,E.Name from Employee Where DeptId in (Select Id from Dept)

When I consider performance which of the two queries would be faster and why ?

Also is there a time when I should prefer one over the other?

Sorry if this is too trivial and asked before but I am confused about it. Also, it would be great if you guys can suggest me tools i should use to measure performance of two queries. Thanks a lot!

@Lucero, this question is tagged sql-server-2008, where the post you mention is tagged MySql. You can infer that the answers will be the same. Performance optimisation is done differently on the two RDBMSs. — Francois Botha, Apr 25 '12 at 15:35

score 59 · Answer 1 · edited Sep 20 '19 at 17:37

59

Well, I believe it's an "Old but Gold" question. The answer is: "It depends!". The performances are such a delicate subject that it would be too much silly to say: "Never use subqueries, always join". In the following links, you'll find some basic best practices that I have found to be very helpful:

I have a table with 50000 elements, the result i was looking for was 739 elements.

My query at first was this:

SELECT  p.id,
    p.fixedId,
    p.azienda_id,
    p.categoria_id,
    p.linea,
    p.tipo,
    p.nome
FROM prodotto p
WHERE p.azienda_id = 2699 AND p.anno = (
    SELECT MAX(p2.anno) 
    FROM prodotto p2 
    WHERE p2.fixedId = p.fixedId 
)

and it took 7.9s to execute.

My query at last is this:

SELECT  p.id,
    p.fixedId,
    p.azienda_id,
    p.categoria_id,
    p.linea,
    p.tipo,
    p.nome
FROM prodotto p
WHERE p.azienda_id = 2699 AND (p.fixedId, p.anno) IN
(
    SELECT p2.fixedId, MAX(p2.anno)
    FROM prodotto p2
    WHERE p.azienda_id = p2.azienda_id
    GROUP BY p2.fixedId
)

and it took 0.0256s

Good SQL, good.

edited Sep 20 '19 at 17:37

Manuel Jordan

15,253
21
95
158

answered Jul 05 '13 at 13:42

linuxatico

1,878
30
43

6

Interesting, could you explain how adding the GROUP BY fixed it? – cozos Nov 08 '17 at 23:27
15

The temporary table generated by the subquery was smaller. Therefore the execution is quicker since there are less data to check in. – Sirmyself May 23 '18 at 12:30
6

I think that in first query you have shared variable between outer query and subquery, so for every row in main query, subquery executes but in second one the subquery only executes one time and this way performance improved. – Ali Faradjpour Apr 05 '19 at 10:42
2

Sql server and MySql and ...Sql (excepting NoSql) are so similar in infrastructure. We have a kind of query optimization engine underneath which converts the IN (...) clauses to join (if it was possible). But when you have a Group by on a well indexed column (based on its cardinality) then it will be much faster. So it really depends on the situation. – AbbasAli Hashemian May 17 '20 at 14:22
2

are you sure the buffer was clean? it makes a lot of sense that if you ran both queries one after the other there would be a massive difference in performance – Yuval Perelman Feb 01 '21 at 15:55

score 57 · Accepted Answer · answered Oct 04 '10 at 14:44

57

I would EXPECT the first query to be quicker, mainly because you have an equivalence and an explicit JOIN. In my experience IN is a very slow operator, since SQL normally evaluates it as a series of WHERE clauses separated by "OR" (WHERE x=Y OR x=Z OR...).

As with ALL THINGS SQL though, your mileage may vary. The speed will depend a lot on indexes (do you have indexes on both ID columns? That will help a lot...) among other things.

The only REAL way to tell with 100% certainty which is faster is to turn on performance tracking (IO Statistics is especially useful) and run them both. Make sure to clear your cache between runs!

answered Oct 04 '10 at 14:44

JNK

63,321
15
122
138

20

I have serious doubt on this answer, since most DBMS, definitely SQL Server 2008 and later, translate the single ID subquery (not correlated, meaning: not referencing multiple outer query columns) into a relatively fast semi-join. Also, as previously noted in another answer, the first, real join will return a row for EACH occurence of the matching ID in Dept - this makes no difference for a unique ID, but will give you tons of duplicates elsewhere. Sorting these out with DISTINCT or GROUP BY will be another, heavy performance load. Check execution plans in SQL Server Management Studio! – Erik Hart Dec 27 '13 at 09:32
4

The IN clause as an equivalent to OR applies to parameter/value lists, but not to subqueries, which are mostly treated like joins. – Erik Hart Dec 27 '13 at 09:55

score 18 · Answer 3 · edited Sep 12 '12 at 05:14

18

Performance is based on the amount of data you are executing on...

If it is less data around 20k. JOIN works better.

If the data is more like 100k+ then IN works better.

If you do not need the data from the other table, IN is good, But it is alwys better to go for EXISTS.

All these criterias I tested and the tables have proper indexes.

edited Sep 12 '12 at 05:14

Hardik Mishra

14,779
9
61
96

answered Jun 28 '12 at 20:13

JP Emvia

1,724
1
12
6

score 11 · Answer 4 · answered Oct 04 '10 at 15:02

Start to look at the execution plans to see the differences in how the SQl Server will interpret them. You can also use Profiler to actually run the queries multiple times and get the differnce.

I would not expect these to be so horribly different, where you can get get real, large performance gains in using joins instead of subqueries is when you use correlated subqueries.

EXISTS is often better than either of these two and when you are talking left joins where you want to all records not in the left join table, then NOT EXISTS is often a much better choice.

Lucero · Answer 5 · 2010-10-04T14:39:34.673

6

The performance should be the same; it's much more important to have the correct indexes and clustering applied on your tables (there exist some good resources on that topic).

(Edited to reflect the updated question)

edited Oct 04 '10 at 14:39

answered Oct 04 '10 at 14:30

Lucero

59,176
9
122
152

score 6 · Answer 6 · answered Aug 14 '18 at 23:35

I know this is an old post, but I think this is a very important topic, especially nowadays where we have 10M+ records and talk about terabytes of data.

I will also weight in with the following observations. I have about 45M records in my table ([data]), and about 300 records in my [cats] table. I have extensive indexing for all of the queries I am about to talk about.

Consider Example 1:

UPDATE d set category = c.categoryname
FROM [data] d
JOIN [cats] c on c.id = d.catid

versus Example 2:

UPDATE d set category = (SELECT TOP(1) c.categoryname FROM [cats] c where c.id = d.catid)
FROM [data] d

Example 1 took about 23 mins to run. Example 2 took around 5 mins.

So I would conclude that sub-query in this case is much faster. Of course keep in mind that I am using M.2 SSD drives capable of i/o @ 1GB/sec (thats bytes not bits), so my indexes are really fast too. So this may affect the speeds too in your circumstance

If its a one-off data cleansing, probably best to just leave it run and finish. I use TOP(10000) and see how long it takes and multiply by number of records before I hit the big query.

If you are optimizing production databases, I would strongly suggest pre-processing data, i.e. use triggers or job-broker to async update records, so that real-time access retrieves static data.

score 5 · Answer 7 · answered Sep 09 '11 at 09:51

The two queries may not be semantically equivalent. If a employee works for more than one department (possible in the enterprise I work for; admittedly, this would imply your table is not fully normalized) then the first query would return duplicate rows whereas the second query would not. To make the queries equivalent in this case, the DISTINCT keyword would have to be added to the SELECT clause, which may have an impact on performance.

Note there is a design rule of thumb that states a table should model an entity/class or a relationship between entities/classes but not both. Therefore, I suggest you create a third table, say OrgChart, to model the relationship between employees and departments.

score 1 · Answer 8 · answered Oct 04 '10 at 14:44

1

You can use an Explain Plan to get an objective answer.

For your problem, an Exists filter would probably perform the fastest.

answered Oct 04 '10 at 14:44

Snekse

15,474
10
62
77

3

"an Exists filter would probably perform the fastest" - probably not, I think, although a definitive answer would require testing against the actual data. Exists filters are likely to be faster where there are multiple rows with the same lookup values - so an exists filter might run faster if the query was checking whether other employees had been recorded from the same department, but probably not when looking up against a department table. – Oct 04 '10 at 15:05
1

Would it run slower in that last scenario? – Snekse Oct 04 '10 at 17:08
1

It would depend on the optimiser - under certain circumstances, it might, but normally I would expect very similar performance. – Oct 05 '10 at 12:47

SQL Joins Vs SQL Subqueries (Performance)?

8 Answers8

Linked

Related