Azure SQL Database - wildly different response times for same query

Question

We have a BI service that is running a set of queries on 2 tables of respectively 60k and 140k rows. The database is a S2-class (50 DTU) single-database Azure SQL database.

A sample query looks like this:

select distinct [t1].[DefinitionId], isnull([t3].[Count], 0) as Count, [t1].[Type]
            from [bi].[Trigger] as [t1]
            left join (
                select [t2].[DefinitionId], count(1) as [Count]
                from [bi].[Trigger] as [t2]
                inner join [bi].[Recipient] as [r] on [t2].[RecipientId] = [r].[RecipientId] and [r].[IsDeleted] = 0 and [r].[IsDeactivated] = 0  and [t2].[DeliveryStatus] = @deliveryStatus
                where [t2].[CampaignId] = @campaignId
                group by [t2].[DefinitionId], [t2].[Type]
            ) as [t3] on [t1].[DefinitionId] = [t3].[DefinitionId]
            where [t1].[CampaignId] = @campaignId

Proper indexes are positioned on both [Trigger] and [Recipient] tables:

CREATE INDEX [IX_Trigger_CampaignId] on [bi].[Trigger] ([CampaignId]) INCLUDE ([DefinitionId],[Type]);
CREATE INDEX [IX_Trigger_CampaignIdDeliveryStatus] on [bi].[Trigger] ([CampaignId],[DeliveryStatus]) INCLUDE ([DefinitionId],[Type],[RecipientId]);
CREATE INDEX [IX_Recipient_CampaignId] ON [bi].[Recipient] ([CampaignId]);
CREATE INDEX [IX_Recipient_RecipientIdActive] ON [bi].[Recipient] ([RecipientId],[IsDeactivated],[IsDeleted]);
CREATE INDEX [IX_Recipient_Active] ON [bi].[Recipient] ([IsDeactivated],[IsDeleted]) INCLUDE ([RecipientId]);

Queries are executed using Entity Framework Core using something like this:

        return await _dbContext
            .TriggerCounts
            .FromSql(sqlTriggerCountByDeliveryStatus, new SqlParameter("@campaignId", campaignId), new SqlParameter("@deliveryStatus", deliveryStatus))
            .ToArrayAsync(ct);

Response times for this query vary from 10ms to sometimes over 30s where it simply times out. When running this query from an Azure Data Studio shell, it returns consistently under 200ms.

Obtaining the explain plan for this query does not include any full table/index scan and does not recommend configuring additional indexes.

As far as we know, these queries are being run sequentially as part of a single-process pipeline where commands are dequeued one at a time.

What could explain these vastly different results?

Edit 1

A piece of the puzzle emerged. Running these 2 variants of the above query produce vastly different response times when executed in the Data Studio Shell:

1.

declare @campaignId nvarchar(128) = '3155373858485711912C4FC8383061F44488933B942FBF81BB1';
declare @deliveryStatus nvarchar(32) = 'delivered';

select distinct [t1].[DefinitionId], isnull([t3].[Count], 0) as Count, [t1].[Type]
            from [bi].[Trigger] as [t1]
            left join (
                select [t2].[DefinitionId], count(1) as [Count]
                from [bi].[Trigger] as [t2]
                inner join [bi].[Recipient] as [r] on [t2].[RecipientId] = [r].[RecipientId] and [r].[IsDeleted] = 0 and [r].[IsDeactivated] = 0  and [t2].[DeliveryStatus] = @deliveryStatus
                where [t2].[CampaignId] = @campaignId
                group by [t2].[DefinitionId], [t2].[Type]
            ) as [t3] on [t1].[DefinitionId] = [t3].[DefinitionId]
            where [t1].[CampaignId] = @campaignId

2.

declare @campaignId nvarchar(128) = '3155373858485711912C4FC8383061F44488933B942FBF81BB1';
declare @deliveryStatus nvarchar(32) = 'delivered';

select distinct [t1].[DefinitionId], isnull([t3].[Count], 0) as Count, [t1].[Type]
            from [bi].[Trigger] as [t1]
            left join (
                select [t2].[DefinitionId], count(1) as [Count]
                from [bi].[Trigger] as [t2]
                inner join [bi].[Recipient] as [r] on [t2].[RecipientId] = [r].[RecipientId] and [r].[IsDeleted] = 0 and [r].[IsDeactivated] = 0  and [t2].[DeliveryStatus] = 'delivered'
                where [t2].[CampaignId] = '3155373858485711912C4FC8383061F44488933B942FBF81BB1'
                group by [t2].[DefinitionId], [t2].[Type]
            ) as [t3] on [t1].[DefinitionId] = [t3].[DefinitionId]
            where [t1].[CampaignId] = '3155373858485711912C4FC8383061F44488933B942FBF81BB1'

Variant 1 consistently returns under 200ms, while variant 2 consistently returns around 10s.

Response times for variant 2 are more in line with what we're getting in Azure. Are we running the query properly with EF Core?

Edit 2

Here are the query plans:

Another interesting element is that query1 has timed out at 11:30:05 but then ran in <300ms at 11:30:38, merely 30 seconds later, with the same parameters...

Have you eliminated other sources of delay, in the BI service itself or elsewhere in the scope of what you are measuring? Fundamentally if you can run the query consistently fast using one scenario, but not with another, it implies the delay is not in the database which is common to both scenarios - so step 1 would be to confirm exactly where the delay occurs. — James Gaunt, Apr 26 '19 at 22:10
I believe so. If you look at Edit 1, I have additional data that seems to indicate that we're not parameterizing the query properly. Any ideas? — Spiff, Apr 26 '19 at 23:55
So the only difference is one query uses the parameters and the other specifies the values explicitly? If so that is very strange! Hard to believe something else isn't going on here. I would try to simplify both queries to see if the discrepancy continues, and if not the point at which they start to perform the same may help indicate what is going on. — James Gaunt, Apr 27 '19 at 00:06
How should we deal with parameter sniffing? And given the way we're executing the query, are we prone to be affected by this? — Spiff, Apr 27 '19 at 01:12
Do these two cases have different query plans? That's parameter sniffing - the query gets compiled for one set of parameters then the same plan is used for different parameters when it shouldn't. This is not simple to fix, but there's a wealth of knowledge if you google something like EF parameter sniffing. Here's some thins you can try. https://stackoverflow.com/questions/25145667/ef-6-parameter-sniffing — Nick.Mc, Apr 27 '19 at 10:55
Thanks for the pointers. It does look like the problem is due to a bad plan cached for the EF query. After updating statistics for the 2 tables, the query consistently returns under 200ms for both EF and manual execution. — Spiff, Apr 27 '19 at 18:48
I have tried appending OPTIONS(RECOMPILE) and OPTIONS(OPTIMIZE FOR UNKNOWN) but it didn't seem to have any effect. Maybe it did recompile the plan but it was based on outdated statistics so the query still timed out? — Spiff, Apr 27 '19 at 18:52
Please post query plans. You likely have a parameter sniffing issue impacting the plan choice. — Conor Cunningham MSFT, Apr 28 '19 at 03:20

Azure SQL Database - wildly different response times for same query

0 Answers0