1

I have a performance problem with a stored procedure. Because when I check my benchmark's result I realized that "MatchxxxReferencesByIds" has '240.25' ms Average LastElapsedTimeInSecond. How can I improve my procedure?

ALTER PROCEDURE [Common].[MatchxxxReferencesByIds]
    (@refxxxIds VARCHAR(MAX),
     @refxxxType NVARCHAR(250))
BEGIN
    SET NOCOUNT ON;

    BEGIN TRAN

    DECLARE @fake_tbl TABLE (xxxid NVARCHAR(50))

    INSERT INTO @fake_tbl  
        SELECT LTRIM(RTRIM(split.a.value('.', 'NVARCHAR(MAX)'))) AS fqdn   
        FROM 
            (SELECT 
                 CAST ('<M>' + REPLACE(@refxxxIds, ',', '</M><M>') + '</M>' AS XML) AS data
            ) AS a   
        CROSS APPLY 
            data.nodes ('/M') AS split(a)

    SELECT [p].[ReferencedxxxId]  
    FROM [Common].[xxxReference] AS [p] 
    WHERE ([p].[IsDeleted] = 0) 
      AND (([p].[ReferencedxxxType] COLLATE Turkish_CI_AS  = @refxxxType COLLATE Turkish_CI_AS ) 
      AND [p].[ReferencedxxxId] COLLATE Turkish_CI_AS  IN (SELECT ft.xxxid COLLATE Turkish_CI_AS FROM @fake_tbl ft))

    COMMIT;
END;
halfer
  • 19,824
  • 17
  • 99
  • 186
Penguen
  • 16,836
  • 42
  • 130
  • 205
  • 1
    Where are you calling this stored procedure from? If possible, prefer using table valued parameters instead of passing a string and split it inside the procedure. – Zohar Peled Jun 19 '19 at 06:59
  • @Penguen please show us the sample input data too. – Vahid Farahmandian Jun 19 '19 at 07:11
  • @VahidFarahmandian For execution ; exec [Common].[MatchEntityReferencesByIds] '423423,423423,423432,23423' 'TEST' – Penguen Jun 19 '19 at 07:43
  • 1
    Hard-coding a collation will *prevent* the server from using any indexes on that column if they have a different collation. Why is `COLLATE Turkish_CI_AS` used? – Panagiotis Kanavos Jun 19 '19 at 07:51
  • 1
    No query change is going to remove the full table scan caused by different collations. What is the table's schema (CREATE statement) ? Perhaps you should change that column's collation, or create a different column with the same data but different collation. [It's not possible to create an index with a different collation from the column's](https://stackoverflow.com/questions/6199440/collations-on-indexes-in-sql-server) – Panagiotis Kanavos Jun 19 '19 at 07:57
  • 1
    On the other hand, `ReferencedEntityId` looks like something that should be an `int` or `bigint`, not a text type with a collation. If you changed the table variable's column to an `int` you could get rid of collations and possible missing indexes – Panagiotis Kanavos Jun 19 '19 at 07:57
  • Please post the table schema, index information and the table size (number of rows, data size). Without these, people have to guess - is `ReferencedEntityId` an `int` or not? What is the type of `ReferencedEntityType`? and why is the collation change needed? Fixing the table variable won't help if the table has the wrong types or columns – Panagiotis Kanavos Jun 19 '19 at 08:24
  • @Penguen please *stop* changing peoples' answers. Just because you decided to change the column names in the question doesn't mean the answers are wrong or that the changes are meaningful. – Panagiotis Kanavos Jun 20 '19 at 10:04

4 Answers4

4

One can only make assumptions without knowing the table's schema, indexes and data sizes.

Hard-coding collations can prevent the query optimizer from using any indexes on the ReferencedEntityId column. The field name and sample data '423423,423423,423432,23423' suggest this is a numeric column anyway (int? bigint?). The collation shouldn't be needed and the variable's column type should match the table's type.

Finally, a.value can return an int or bigint directly, which means the splitting query can be rewritten as:

declare @refEntityIds nvarchar(max)='423423,423423,423432,23423';

DECLARE @fake_tbl TABLE (entityid bigint PRIMARY KEY, INDEX IX_TBL(Entityid))

INSERT INTO @fake_tbl  
SELECT split.a.value('.', 'bigint') AS fqdn   
FROM 
    (SELECT 
            CAST ('<M>' + REPLACE(@refEntityIds, ',', '</M><M>') + '</M>' AS XML) AS data
    ) AS a   
CROSS APPLY 
    data.nodes ('/M') AS split(a)

The input data contains some duplicates so entityid can't be a PRIMARY KEY.

After that, the query can change to :

SELECT [p].[ReferencedEntityId]  
FROM [Common].[EntityReference] AS [p] 
WHERE [p].[IsDeleted] = 0
  AND [p].[ReferencedEntityType] COLLATE Turkish_CI_AS  = @refEntityType COLLATE Turkish_CI_AS 
  AND [p].[ReferencedEntityId]  IN (SELECT ft.entityid FROM @fake_tbl ft)

The next problem is the hard-coded collation. Unless it matches the column's actual collation, this prevents the server from using any indexes that cover that column. How to fix this depends on the actual data statistics. Perhaps the column's collation has to change or perhaps the rows after filtering by ReferencedEntityId are so few that there's no benefit to this.

Finally, IsDeleted can't be indexed. It's either a bit columns whose values are 1/0 or another numberic column that still contains 0/1. An index that's so bad at selecting rows won't be used by the query optimizer because it's actually faster to just scan the rows returned by the other conditions.

A general rule is to put the most selective index column first. The database combines all columns to create one "key" value and construct a B+-tree index from it. The more selective the key, the fewer index nodes need to be scanned.

IsDeleted can still be used in a filtered index to index only the non-deleted columns. This allows the query optimizer to eliminate unwanted columns from the search. The resulting index will be smaller too, which means the same number of IO operations will load more index pages in memory and allow faster seeking.

All of this means that EntityReference should have an index like this one.

CREATE NONCLUSTERED INDEX IX_EntityReference_ReferenceEntityID  
    ON Common.EntityReference (ReferenceEntityId, ReferenceEntityType)  
    WHERE IsDeleted =0; 

If the collations don't match, ReferenceEntityType won't be used for seeking. If this is the most common case we can remove ReferenceEntityType from the index and put it in an INCLUDE clause. The field won't be part of the index key although it will still be available for filtering without having to load data from the actual table:

CREATE NONCLUSTERED INDEX IX_EntityReference_ReferenceEntityID  
    ON Common.EntityReference (ReferenceEntityId)  
    INCLUDE(ReferenceEntityType)
    WHERE IsDeleted =0; 

Of course, if that's the most common case the column's collation should be changed instead

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
2
  1. I don't think you need a TRAN. You are simply "shredding" your comma seperated values into a @variable table. and doing a SELECT. TRAN not need here.

  2. try an exists

SELECT [p].[ReferencedEntityId]  
    FROM [Common].[EntityReference] AS [p] 
    WHERE ([p].[IsDeleted] = 0) 
      AND (([p].[ReferencedEntityType] COLLATE Turkish_CI_AS  = @refEntityType COLLATE Turkish_CI_AS ) 
      AND EXISTS (SELECT 1 FROM @fake_tbl ft WHERE ft.entityid COLLATE Turkish_CI_AS =  [p].[ReferencedEntityId] COLLATE Turkish_CI_AS  )

3.

See https://www.sqlshack.com/efficient-creation-parsing-delimited-strings/

for different ways to parse your delimited string.

quote from article:

Microsoft’s built-in function provides a solution that is convenient and appears to perform well. It isn’t faster than XML, but it clearly was written in a way that provides an easy-to-optimize execution plan. Logical reads are higher, as well. While we cannot look under the covers and see exactly how Microsoft implemented this function, we at least have the convenience of a function to split strings that are shipped with SQL Server. Note that the separator passed into this function must be of size 1. In other words, you cannot use STRING_SPLIT with a multi-character delimiter, such as ‘”,”’.

  1. post a screen shot of your execution plan. if you don't have proper index (or you have "hints" that prevent use of indexes)..your query will never perform well.
Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
granadaCoder
  • 26,328
  • 10
  • 113
  • 146
  • The `COLLATE` clause can easily prevent the server from using any indexes and force a full table scan. `EXISTS` doesn't affect. In any case the server probably generates the same execution plan in both cases as the two statements (IN vs EXISTS) are equivalent – Panagiotis Kanavos Jun 19 '19 at 07:52
  • So I haven't done much of this kind of tweaking since 2005/2008 sql server, but a ~few~ times I got better performance by flip-flopping IN and EXISTS. The other flip flop I would try would be @variableTable vs #tempTable. And of course parameter sniffing (not germane to this question) was my other "try it to make sure". – granadaCoder Jun 19 '19 at 07:58
  • Great comment about "It's not possible to create an index with a different collation from the column's" (in the original questions comments by @Pan – granadaCoder Jun 19 '19 at 07:59
2

Based on the execution plan of the stored procedure, what makes it to perform slowly, is the part where you are going to work with XML.

Lets rethink about the solution:

I have created a table like this:

CREATE TABLE [Common].[EntityReference]
(
    IsDeleted BIT,
    ReferencedEntityType VARCHAR(100),
    ReferencedEntityId VARCHAR(10)
);
GO

and manipulate it like this(Insert 1M records into it):

DECLARE @i INT = 1000000;
DECLARE @isDeleted BIT,
        @ReferencedEntityType VARCHAR(100),
        @ReferencedEntityId VARCHAR(10);
WHILE @i > 0
BEGIN
    SET @isDeleted =(SELECT @i % 2);
    SET @ReferencedEntityType = 'TEST' + CASE WHEN @i % 2 = 0 THEN '' ELSE CAST(@i % 2 AS VARCHAR(100)) END;
    SET @ReferencedEntityId = CAST(@i AS VARCHAR(10));
    INSERT INTO [Common].[EntityReference]
    (
        IsDeleted,
        ReferencedEntityType,
        ReferencedEntityId
    )
    VALUES (@isDeleted, @ReferencedEntityType, @ReferencedEntityId);

    SET @i = @i - 1;
END;

lets analyse your code:

You have a comma delimited input(@refEntityIds), which you want to split it and then run a query against these values. (your SP's subtree cost in my PC is about 376) To do so you have different approaches:

1.Pass a table variable to stored procedure which contains the refEntityIds

2.Make use of STRING_SPLIT function to split the string Lets see the sample query:

INSERT INTO @fake_tbl
      SELECT value
        FROM STRING_SPLIT(@refEntityIds, ',');

Using this, you will gain a great performance improvement in your code.(subtree cost: 6.19 without following indexes) BUT this feature is not available in SQL Server 2008!

You can use a replacement for this function(read this: https://stackoverflow.com/a/54926996/1666800) and changing your query to this(subtree cost still is about 6.19):

INSERT INTO @fake_tbl
    SELECT value FROM dbo.[fn_split_string_to_column](@refEntityIds,',')

In this case again you will see the notable performance improvement.

You can also create a non clustered index on [Common].[EntityReference] table, which has a little performance improvement too. But please think about creating index, before creating it, it might have negative impact on your DML operations:

CREATE NONCLUSTERED INDEX [Index Name] ON [Common].[EntityReference]
(
    [IsDeleted] ASC
)
INCLUDE ([ReferencedEntityType],[ReferencedEntityId]) 

In case that I have not this index(Suppose that I have replaced your split solution with mine), the subtree cost is: 6.19, When I add aforementioned index, the subtree cost decreased to 4.70, and finally when I change the index to following one, the subtree cost is 5.16

CREATE NONCLUSTERED INDEX [Index Name] ON [Common].[EntityReference]
(
    [ReferencedEntityType] ASC,
    [ReferencedEntityId] ASC
)
INCLUDE ([IsDeleted]) 

Thanks to @PanagiotisKanavos following index will even performs better than the aforementioned ones (subtree cost: 3.95):

CREATE NONCLUSTERED INDEX IX_EntityReference_ReferenceEntityID  
    ON Common.EntityReference (ReferencedEntityId)  
    INCLUDE(ReferencedEntityType)
    WHERE IsDeleted =0; 

Also please note that, using transaction against a local table variable has almost no effect, and probably you can simply ignore it.

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
Vahid Farahmandian
  • 6,081
  • 7
  • 42
  • 62
  • That index probably won't improve anything because of its low cardinality. It won't be used by the optimizer because it simply splits the records in half. If the type is `bit`, it's not possible to create the index in the first place – Panagiotis Kanavos Jun 19 '19 at 08:04
  • 1
    A *far* better index would be (ReferencedEntityId,ReferencedEntityType) INCLUDE(IsDeleted). `IsDeleted` could be used as an index filter too – Panagiotis Kanavos Jun 19 '19 at 08:06
  • @PanagiotisKanavos thank for the comment. I have updated the answer and provides the subtree cost for different indexing situations – Vahid Farahmandian Jun 19 '19 at 08:14
  • You can't provide that without actual data, and we have no idea what the OP's table looks like. If the subtree costs are so close it probably means there's so little data that the indexes either aren't used, or just scanned – Panagiotis Kanavos Jun 19 '19 at 08:16
  • As for the replacement for splitting, the loop technique is the slowest one, while SQLCLR and XML are the fastest. All splitting techniques come from [Aaron Bertrand's articles and benchmarks](https://sqlperformance.com/2012/07/t-sql-queries/split-strings), one way or another – Panagiotis Kanavos Jun 19 '19 at 08:17
  • @PanagiotisKanavos you are right about the index, that is why I bold the "a little" – Vahid Farahmandian Jun 19 '19 at 08:18
  • @PanagiotisKanavos "splitting.. it is slower", slower than which? – Vahid Farahmandian Jun 19 '19 at 08:19
  • 1
    In fact, the loop technique scales so badly [it was eliminated from the benchmarks](https://sqlblog.org/2010/07/07/splitting-a-list-of-integers-another-roundup) – Panagiotis Kanavos Jun 19 '19 at 08:19
  • @PanagiotisKanavos I have updated the answer and manipulate the table with 1M records, but still same result! – Vahid Farahmandian Jun 19 '19 at 08:33
2

If [p].[ReferencedEntityId] is going to be integers, then you don't need to apply COLLATE clause. You can directly apply IN condition.

  1. You can go for simple comma separated values to integers list using Table valued functions. There are many samples. Keep the datatype of the ID as integer, to avoid collation application.
[p].[ReferencedEntityId] IN (SELECT ft.entityid AS FROM @fake_tbl ft))
Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
Venkataraman R
  • 12,181
  • 2
  • 31
  • 58
  • #1 is what the OP already did. In fact, the OP used a far faster method than any of those described in that link. – Panagiotis Kanavos Jun 19 '19 at 08:00
  • @PanagiotisKanavos, OP is using IN condition as below: `[p].[ReferencedEntityId] COLLATE Turkish_CI_AS IN (SELECT ft.entityid COLLATE Turkish_CI_AS FROM @fake_tbl ft))` – Venkataraman R Jun 19 '19 at 08:02
  • 1
    I'm referring to `You can go for simple comma separated values to integers list`. The OP is already doing that, using a faster method. All the string splitting techniques can be traced to Aaron Bertrand's articles on string splitting [like this one](https://sqlperformance.com/2012/07/t-sql-queries/split-strings). The XML technique is the fastest after the SQLCLR technique – Panagiotis Kanavos Jun 19 '19 at 08:04
  • @PanagiotisKanavos, oh ok. Got it. Thanks for clarifying. – Venkataraman R Jun 19 '19 at 08:53