I have a large SQL table with 100+ million records in SQL Server as follows:
+-----------+----------+----------+----------+----------+
|CustomerID |TransDate |Category | Num_Trans |Sum_Trans |
+-----------+----------+----------+----------+----------+
|457136432 |2022-12-31|TAXI |18 |220.34 |
|863326783 |2022-12-31|FOOD |76 |980.71 |
+-----------+----------+----------+----------+----------+
this table contains number of transactions and sum of transaction values for each category in a 6-month window prior to "TransDate" per Customer.
I need to calculate some percentile calculations for a reporting task (query below)
SELECT DISTINCT TransDate, Category,
, PERCENTILE_CONT (0.05) WITHIN GROUP (ORDER BY Num_Trans) OVER (PARTITION BY TransDate, Category)
AS P_LOWER_NUM
, PERCENTILE_CONT (0.5) WITHIN GROUP (ORDER BY Num_Trans) OVER (PARTITION BY TransDate, Category)
AS P_MEDIAN_NUM
, PERCENTILE_CONT (0.95) WITHIN GROUP (ORDER BY Num_Trans) OVER (PARTITION BY TransDate, Category)
AS P_UPPER_NUM
, PERCENTILE_CONT (0.05) WITHIN GROUP (ORDER BY Sum_Trans) OVER (PARTITION BY TransDate, Category)
AS P_LOWER_SUM
, PERCENTILE_CONT (0.5) WITHIN GROUP (ORDER BY Sum_Trans) OVER (PARTITION BY TransDate, Category)
AS P_MEDIAN_SUM
, PERCENTILE_CONT (0.95) WITHIN GROUP (ORDER BY Sum_Trans) OVER (PARTITION BY TransDate, Category)
AS P_UPPER_SUM
FROM large_tbale
with the following non-clustered index on the table
CREATE NONCLUSTERED INDEX MyIndex
ON large_table (TransDate, Category)
INCLUDE (Num_Trans, Sum_Trans);
unfortunately, the table creation takes over 30 minutes which is not acceptable for my application.
to boost the performance, I even partitioned the table based on "TransDate" and created index on the partitions. but the execution time does not change significantly by table partitioning either.
Is there anyway to speed up these calculations?
PS- looking at the execution plan, it seems the problem is the fact that for each percentile_cont calculation, the query needs to sort the records. Is it possible to somehow sort "Num_Trans" and "Sum_trans" columns once and then do percentile calculations?