Performance: What's the difference between the two queries

Question

I originally had written the following

SELECT t1.TransactionNumber
FROM t1
    JOIN
    (
          SELECT MAX(id) id
          FROM t1
          WHERE Period BETWEEN '01-11-2013' and '01-12-2014'
          GROUP BY AccountNumber
    ) t2
        on t1.id= t2.id

But it was too slow. It took around 20 seconds, so as a test, I changed it to the following

 SELECT MAX(id) AS id
 INTO #t2
 FROM t1
 WHERE Period BETWEEN '01-11-2013' and '01-12-2014'
 GROUP BY AccountNumber

 SELECT t1.id 
 FROM t1
    JOIN #t2 t2
        ON t1.id= t2.id

The second query took only 1 second to run. The second query does a index seek using the PK key, whereas the first key does a scan.

Note: id is the primary key clustered on the t1 table.

Did you mean this to be a self join on the primary key column of the table `t1` or are there other tables involved? — Mikael Eriksson, Feb 21 '14 at 06:38
There are other tables but it shouldn't be relevent. In the first query, if I run just the subquery itself, it takes about 1 second. But then if your the whole query, it takes over 20 seconds. I would have thought it should have taken no more than 2 seconds since it's joining on the primary key. The records returns from the subquery is less than 2000 rows too. — user172839, Feb 23 '14 at 23:05
Please be aware other tables might relevant. I have experienced many times of such situation. A big query with sub queries runs very slow, once separate sub queries out as you did, it runs much faster. The execute plan will tell you a lot. Also statistics io on will be helpful. — John Jin, Jun 29 '14 at 23:23

score 1 · Answer 1 · answered Mar 09 '14 at 18:07

This is a guess, but it could be because the statistics on the primary key are not working for you. If the query optimizer thinks you are only going to return 10 records from your inner join, but instead you return 100, it overflows the memory buffer and then ends up having to write the results of the subquery to disk. If you post your query execution plan results, it should become pretty obvious.

score 0 · Answer 2 · edited May 23 '17 at 12:28

0

Can't you place all your condition in the ON section?

SELECT t1.id
FROM t1
    JOIN
    (
          SELECT id
          FROM t1
          WHERE <condition>
    ) t2
        on t1.id = t2.id;

is transformed into

SELECT t1.id
FROM t1
     JOIN t1 as t2
        ON t1.id = t2.id AND <condition>

UPDATE:

Retrieving the last record in each group The link shows how to retrieve last record in group. The SQL is following

SELECT m1.*
FROM messages m1 LEFT JOIN messages m2
 ON (m1.name = m2.name AND m1.id < m2.id)
WHERE m2.id IS NULL;

You can use this rather than group by

edited May 23 '17 at 12:28

Community

1
1

answered Feb 21 '14 at 06:18

StanislavL

56,971
9
68
98

The table stores transactions. I'm getting the maximum transaction for each account. – user172839 Feb 21 '14 at 06:21
Thanks, but I'm actually more interested for my own curiosity why the query has gone from 20seconds+ to 1 second from simplying moving the inner query to a temp table and then joining onto the temp table. – user172839 Feb 24 '14 at 06:10
A triangular join is not likely to be faster than a `GROUP BY`. – RBarryYoung Mar 09 '14 at 18:50
@user172839 Your curiosity is best satisfied if you look at the execution plans for those two queries, like has been mentioned in many other comments. If you are using MS SQL Server management studio, simply type the query and click "Display estimated execution plan" from the toolbar. – Esko Piirainen Jul 30 '14 at 08:35

score 0 · Answer 3 · answered Mar 04 '14 at 14:01

The Main difference between these 2 queries is that the second one is strictly querying off of the index!!!

SELECT t1.TransactionNumber
FROM t1
JOIN
(
      SELECT MAX(id) id
      FROM t1
      WHERE Period BETWEEN '01-11-2013' and '01-12-2014'
      GROUP BY AccountNumber
) t2
    on t1.id= t2.id

Is querying the transactionNumber column so therefore is cannot use the undex you have on the table, the second query is ONLY using the ID. This will make all the difference in the world.

Sorry I didn't actually write the query here correctly. The select columns for both queries should be the same. — user172839, Mar 04 '14 at 23:16

score 0 · Answer 4 · answered Mar 07 '14 at 00:20

0

Joins consume lot of resources and the computed result need to rejoin with the table again so its takes long time. Where as when you are using temp tables the result is already stored in temp table so join condition runs faster than subqueries.

answered Mar 07 '14 at 00:20

user2238022

39
4

score 0 · Answer 5 · answered Mar 13 '14 at 07:03

0

Try This :

SELECT t1.TransactionNumber t1 WHERE t1.id = (SELECT MAX(id) id FROM t1 WHERE Period BETWEEN '01-11-2013' and '01-12-2014' GROUP BY AccountNumber)

answered Mar 13 '14 at 07:03

Mohammed Saqib Rajput

1,331
1
14
22

score 0 · Answer 6 · answered Jun 28 '14 at 07:26

It is generally better for performance (though a bit more code) to explicitly declare the columns and data types instead of SELECT..INTO. This may be faster:

CREATE TABLE #t2
    (
    id INT
    );
INSERT INTO #t2(id)
VALUES
    (
    SELECT MAX(id)
    FROM t1
    WHERE Period <= '01-11-2013' 
    AND Period > '01-12-2014'
    GROUP BY AccountNumber
    );
 SELECT t1.id 
 FROM t1
    JOIN #t2 t2
        ON t1.id= t2.id

score 0 · Answer 7 · answered Aug 03 '14 at 10:48

The difference is that in the 1st query, the engine doesn't know the number of results in t2 (which I assume is a relatively small number compared to t1 count, but SQL server doesnt know that in advance). So the execution plan begins with t1 (looping on a large number of rows). However in 2nd query, t2 already has X number of records, which is known for the engine before executing the 2nd part of the query. So in that query the SQL engine will start the execution using t2 (SCAN t2 since its small), and for each key in t2, it will perform index seek in t1.

Performance: What's the difference between the two queries

7 Answers7