0

Are there differences in speed between those 3 queries? They all return the same.

The data structure of this MRE is two tables: tests and id_tests.

  • tests have 3 columns (unique_id, allele, mfi_valeur). Primary key: unique_id. It's also a foreign key that references unique_id from the second table id_tests.
  • id_tests have 2 columns (unique_id, type_test). Primary key: unique_id.
SELECT i.type_test
  FROM [DATA_CQE].[dbo].[tests] as t, [DATA_CQE].[dbo].[id_tests] as i
  WHERE t.unique_id = i.unique_id
  AND t.allele = 'A*01:01'
  AND t.mfi_valeur > 10000;
SELECT i.type_test
  FROM [DATA_CQE].[dbo].[tests] as t
  JOIN [DATA_CQE].[dbo].[id_tests] as i
  ON t.unique_id = i.unique_id
  WHERE t.allele = 'A*01:01'
  AND t.mfi_valeur > 10000;
SELECT type_test 
FROM [DATA_CQE].[dbo].[id_tests]
WHERE unique_id IN (
  SELECT unique_id
  FROM [DATA_CQE].[dbo].[tests]
  WHERE allele = 'A*01:01'
  AND mfi_valeur > 10000
);

Here are the query plans: https://www.brentozar.com/pastetheplan/?id=BJWQF89r3. I'm not sure I'm fully aware how I should read it. The queries are insanely fast because I truncated my database for the tests (instead of several tens of millions of rows I only left tens of thousands).

What are the rules to determine what is the best way to do such queries and how would one reliably test differences between queries?

  • 1
    The first and second are identical besides the syntax difference. Your first is just an older style of join before the keyword `JOIN` was added to the sql standard a million years ago. I would avoid it since it's no longer the 1980s. – JNevill May 23 '23 at 15:29

1 Answers1

2

The first two queries are absolutely identical. An inner join is always transformed by the compiler into a cross-join and filter, before further transformations are done, usually ending up in one of the different join implementations. So it makes no difference how you write it.

Having said that, comma-join syntax, or even CROSS JOIN, is very difficult to understand, and it's much better to write it using explicit JOIN syntax.


So comparing the second and third queries:

The third query is a semi-join, it does not require any results from the tests table. But since both tables have a primary/unique key on unique_id, and you are not selecting tests even in the second query, there will be no difference in the results.

For this reason, it is unlikely that the compiler will choose a different plan using one syntax or the other. As you can see from your query plans, they are all the same.

If, however, you did not have that primary/unique key then the results could be significantly different. So the compiler may choose a different plan.


TL;DR; Write your queries in the most obvious fashion first, then worry about performance afterwards.


I also note that you can improve your existing plan by adding the following index. This will help in the case when only a small portion of the tests table is actually being used, where your existing plan requires reading the whole table.

CREATE INDEX IX ON tests (allele, mfi_valeur);
Charlieface
  • 52,284
  • 6
  • 19
  • 43
  • I like the highlight of the semi-join since we only use `tests` for selection of the `ids`. May I ask why we wouldn't expect a better performance since we're only "semi-joining"? It feels like it would require "less work"? – FluidMechanics Potential Flows May 24 '23 at 14:43
  • Because the join condition `unique_id = unique_id` is unique on both sides, so the work is excatly the same. If the inner side `tests` was not unique then you'd be right it would often be more work to do a regular join then aggregate rather than a semi-join (although the compiler may choose to do that anyway if it thinks that's going to be faster) – Charlieface May 24 '23 at 14:49
  • Oh, I see. Is it exactly the same because in both cases, the WHERE is applied "first"? seems like it does for the join: https://stackoverflow.com/questions/5463101/where-clause-better-execute-before-in-and-join-or-after and obviously does for the nested one – FluidMechanics Potential Flows May 24 '23 at 14:50
  • 1
    The optimizer is perfectly capable of transforming between all these three options when necessary. It will work out what filters to apply first etc, and can optimize a semi-join as well as an inner-join. Not always is a semi-join faster: if for every given outer value there are only one or two inner rows then it's often faster to just do a normal hash or merge join rather than trying to seek on each outer value. No, what I'm saying is that in this case `unique_id` is unique on the inner side, so you are always guaranteed to get a max of one row on the inner side. This means it's the exact .... – Charlieface May 24 '23 at 14:56
  • 1
    ... same work as doing an inner join. In the case where it's not unique, there may be multiple rows to match on the inner side, so it is often (but not always) worth doing it as a semi-join. If in that case you were to rewrite it as a `INNER JOIN` plus `GROUP BY` you may find it doing a semi-join anyway, and v.v. – Charlieface May 24 '23 at 14:56