SQL INNER JOIN vs WHERE IN big performance difference

Question

I have read multiple sources and still don't understand where a big difference comes from in a query I have for Microsoft SQL Server.

I need to count different alerts linked to vehicles (IdMateriel is synonymous to the id of the vehicle) based on types (CodeAlerte), state (Etat), and a Top true/false column, but beofre the different counts I need to select the data.

TLDR : There are two parameters, the current date as an SQL DATETIME, and the VARCHAR(MAX) string of entity codes separated by commas, which I split using STRING_SPLIT to use them either in WHERE IN clause or INNER JOIN. Using it in the first clause is ~10x faster than the second clause although it seems equivalent to me. Why?

First, the queries are based on the view created as follows:

CREATE OR ALTER VIEW [dbo].[AlertesVehicules]  
WITH SCHEMABINDING  
AS  
    SELECT dbo.Alerte.IdMateriel, dbo.Materiel.EntiteGestion, dbo.Alerte.IdTypeAlerte, 
           dbo.TypeAlerte.CodeAlerte, dbo.TypeAlerte.TopAlerteMajeure, dbo.Alerte.Etat, 
           Vehicule.Top, Vehicule.EtatVehicule,COUNT_BIG(*) AS COUNT
    FROM  dbo.Alerte 
    INNER JOIN dbo.Materiel on dbo.Alerte.IdMateriel= dbo.Materiel.Id
    INNER JOIN dbo.Vehicule on dbo.Vehicule.Id= dbo.Materiel.Id
    INNER JOIN dbo.TypeAlerte on dbo.Alerte.IdTypeAlerte = dbo.TypeAlerte.Id
    WHERE dbo.Materiel.EntiteGestion is NOT NULL
        AND dbo.TypeAlerte.CodeAlerte IN ('P07','P08','P09','P11','P12','P13','P14')
    GROUP BY dbo.Alerte.IdMateriel, dbo.Materiel.EntiteGestion, dbo.Alerte.IdTypeAlerte, 
             dbo.TypeAlerte.CodeAlerte, dbo.TypeAlerte.TopAlerteMajeure, dbo.Alerte.Etat, 
             Vehicule.Top, Vehicule.EtatVehicule
GO  

CREATE UNIQUE CLUSTERED INDEX IX_AlerteVehicule   
    ON dbo.AlertesVehicules (EntiteGestion,IdMateriel,CodeAlerte,Etat,TopAlerteMajeure);  
GO

This first version of the query takes ~100ms:

SELECT DISTINCT a.IdMateriel, a.CodeAlerte, a.Etat, a.Top INTO #tmpTabAlertes
FROM dbo.AlertesVehicules a
LEFT JOIN tb_AFFECTATION_SECTION ase ON a.IdMateriel = ase.ID_Vehicule
INNER JOIN (SELECT value AS entiteGestion FROM STRING_SPLIT(@entiteGestion, ',')) eg
    ON a.EntiteGestion = eg.entiteGestion                                               
WHERE
    a.CodeAlerte IN ('P08','P09')
    AND @currentDate <= ISNULL(ase.DateFin, @currentDate)

According to SQL Sentry Plan Explorer, the actual execution plan starts with an index seek taking ~30% of the time on dbo.Alerte with the predicate Alerte.IdTypeAlerte=TypeAlerte.Id, outputting 369 000 rows of Etat, IdMateriel and IdTypeAlerte, which it then directly filters down to 7 742 based on predicate PROBE(Opt_Bitmapxxxx,Alerte.IdMateriel), and then inner joins taking ~25% of the time with the 2 results of another index seek of TypeAlerte but with predicates TypeAlerte.CodeAlerte = N'P08' and =N'P09'. So just these two parts take > 50ms, but I don't understand why there are so many initial results.

The second version takes ~10ms :

SELECT DISTINCT a.dMateriel, a.CodeAlerte, a.Etat, a.Top INTO #tmpTab
FROM dbo.AlertesVehicules a
LEFT JOIN tb_AFFECTATION_SECTION ase ON a.IdMateriel = ase.ID_Vehicule 
 WHERE
    a.EntiteGestion IN (SELECT value FROM STRING_SPLIT(@entiteGestion, ','))
    AND a.CodeAlerte IN ('P08','P09')
    AND (@currentDate <= ISNULL(ase.DateFin, @currentDate))

For this one, SQL Sentry Plan Explorer starts with a View Clustered Index Seek directly on the view AlertesVehicules with Seek Predicates AlertesVehicule.EntiteGestion > Exprxxx1 and <Exprxxx2and Predicate AlertesVehicules.CodeAlerte=N'P08' and =N'P09'

Why are those two treated so differently when it seems to me that they are exactly equivalent?

For references, here are some threads I already looked into, but didn't seem to find an explanation in (except that there shouldn't be a difference):

SQL JOIN vs IN performance?

Performance gap between WHERE IN (1,2,3,4) vs IN (select * from STRING_SPLIT('1,2,3,4',','))

[Brent Ozar recently wrote](https://www.brentozar.com/archive/2022/12/should-you-use-sql-server-2022s-string_split/) about using `STRING_SPLIT` in the `WHERE`. TL;DR you may find better performance `INSERT`ing the data (from `STRING_SPLIT`) into an (indexed) temporary table and referencing that. — Thom A, Jan 19 '23 at 12:46
While logically these are two similar operations, they're not identical. The behavior of IN as a filtering criteria isn't the same as the behavior of a JOIN (which, you don't have an ON clause for, so that's going to be a cartesian product, VERY different from IN), so the optimizer makes different choices. That's all. — Grant Fritchey, Jan 19 '23 at 13:15
Also, take a look at Jeff Moden's tally table or table of numbers testing. That frequently outperforms any other option for doing a string split like this. — Grant Fritchey, Jan 19 '23 at 13:16
@GrantFritchey about not having an ON clause, you're right I forgot to include it here but it was in my original problem. I fixed the post now to better reflect the problem. — Théophane, Jan 19 '23 at 14:33
Thanks both for the tips, I'll look into it more. I was wondering wether there is a specific thing I'm not understanding here, or if it "just comes down to the optimizer". I spent some time trying to rewrite the query and I wanted to understand why to be faster the next time I have to rewrite a query — Théophane, Jan 19 '23 at 14:35

SQL INNER JOIN vs WHERE IN big performance difference

0 Answers0