We have a BI service that is running a set of queries on 2 tables of respectively 60k and 140k rows. The database is a S2-class (50 DTU) single-database Azure SQL database.
A sample query looks like this:
select distinct [t1].[DefinitionId], isnull([t3].[Count], 0) as Count, [t1].[Type]
from [bi].[Trigger] as [t1]
left join (
select [t2].[DefinitionId], count(1) as [Count]
from [bi].[Trigger] as [t2]
inner join [bi].[Recipient] as [r] on [t2].[RecipientId] = [r].[RecipientId] and [r].[IsDeleted] = 0 and [r].[IsDeactivated] = 0 and [t2].[DeliveryStatus] = @deliveryStatus
where [t2].[CampaignId] = @campaignId
group by [t2].[DefinitionId], [t2].[Type]
) as [t3] on [t1].[DefinitionId] = [t3].[DefinitionId]
where [t1].[CampaignId] = @campaignId
Proper indexes are positioned on both [Trigger] and [Recipient] tables:
CREATE INDEX [IX_Trigger_CampaignId] on [bi].[Trigger] ([CampaignId]) INCLUDE ([DefinitionId],[Type]);
CREATE INDEX [IX_Trigger_CampaignIdDeliveryStatus] on [bi].[Trigger] ([CampaignId],[DeliveryStatus]) INCLUDE ([DefinitionId],[Type],[RecipientId]);
CREATE INDEX [IX_Recipient_CampaignId] ON [bi].[Recipient] ([CampaignId]);
CREATE INDEX [IX_Recipient_RecipientIdActive] ON [bi].[Recipient] ([RecipientId],[IsDeactivated],[IsDeleted]);
CREATE INDEX [IX_Recipient_Active] ON [bi].[Recipient] ([IsDeactivated],[IsDeleted]) INCLUDE ([RecipientId]);
Queries are executed using Entity Framework Core using something like this:
return await _dbContext
.TriggerCounts
.FromSql(sqlTriggerCountByDeliveryStatus, new SqlParameter("@campaignId", campaignId), new SqlParameter("@deliveryStatus", deliveryStatus))
.ToArrayAsync(ct);
Response times for this query vary from 10ms to sometimes over 30s where it simply times out. When running this query from an Azure Data Studio shell, it returns consistently under 200ms.
Obtaining the explain plan for this query does not include any full table/index scan and does not recommend configuring additional indexes.
As far as we know, these queries are being run sequentially as part of a single-process pipeline where commands are dequeued one at a time.
What could explain these vastly different results?
Edit 1
A piece of the puzzle emerged. Running these 2 variants of the above query produce vastly different response times when executed in the Data Studio Shell:
1.
declare @campaignId nvarchar(128) = '3155373858485711912C4FC8383061F44488933B942FBF81BB1';
declare @deliveryStatus nvarchar(32) = 'delivered';
select distinct [t1].[DefinitionId], isnull([t3].[Count], 0) as Count, [t1].[Type]
from [bi].[Trigger] as [t1]
left join (
select [t2].[DefinitionId], count(1) as [Count]
from [bi].[Trigger] as [t2]
inner join [bi].[Recipient] as [r] on [t2].[RecipientId] = [r].[RecipientId] and [r].[IsDeleted] = 0 and [r].[IsDeactivated] = 0 and [t2].[DeliveryStatus] = @deliveryStatus
where [t2].[CampaignId] = @campaignId
group by [t2].[DefinitionId], [t2].[Type]
) as [t3] on [t1].[DefinitionId] = [t3].[DefinitionId]
where [t1].[CampaignId] = @campaignId
2.
declare @campaignId nvarchar(128) = '3155373858485711912C4FC8383061F44488933B942FBF81BB1';
declare @deliveryStatus nvarchar(32) = 'delivered';
select distinct [t1].[DefinitionId], isnull([t3].[Count], 0) as Count, [t1].[Type]
from [bi].[Trigger] as [t1]
left join (
select [t2].[DefinitionId], count(1) as [Count]
from [bi].[Trigger] as [t2]
inner join [bi].[Recipient] as [r] on [t2].[RecipientId] = [r].[RecipientId] and [r].[IsDeleted] = 0 and [r].[IsDeactivated] = 0 and [t2].[DeliveryStatus] = 'delivered'
where [t2].[CampaignId] = '3155373858485711912C4FC8383061F44488933B942FBF81BB1'
group by [t2].[DefinitionId], [t2].[Type]
) as [t3] on [t1].[DefinitionId] = [t3].[DefinitionId]
where [t1].[CampaignId] = '3155373858485711912C4FC8383061F44488933B942FBF81BB1'
Variant 1 consistently returns under 200ms, while variant 2 consistently returns around 10s.
Response times for variant 2 are more in line with what we're getting in Azure. Are we running the query properly with EF Core?
Edit 2
Here are the query plans:
Another interesting element is that query1 has timed out at 11:30:05 but then ran in <300ms at 11:30:38, merely 30 seconds later, with the same parameters...