0

I have a large SQL table with 100+ million records in SQL Server as follows:

+-----------+----------+----------+----------+----------+
|CustomerID |TransDate |Category | Num_Trans |Sum_Trans |
+-----------+----------+----------+----------+----------+
|457136432  |2022-12-31|TAXI     |18         |220.34    |
|863326783  |2022-12-31|FOOD     |76         |980.71    |
+-----------+----------+----------+----------+----------+

this table contains number of transactions and sum of transaction values for each category in a 6-month window prior to "TransDate" per Customer.

I need to calculate some percentile calculations for a reporting task (query below)

SELECT DISTINCT   TransDate, Category, 

                , PERCENTILE_CONT (0.05) WITHIN GROUP (ORDER BY Num_Trans) OVER (PARTITION BY TransDate, Category)
                AS P_LOWER_NUM
                , PERCENTILE_CONT (0.5) WITHIN GROUP (ORDER BY Num_Trans) OVER (PARTITION BY TransDate, Category)
              AS P_MEDIAN_NUM
                , PERCENTILE_CONT (0.95) WITHIN GROUP (ORDER BY Num_Trans) OVER (PARTITION BY TransDate, Category)
              AS P_UPPER_NUM


                , PERCENTILE_CONT (0.05) WITHIN GROUP (ORDER BY Sum_Trans) OVER (PARTITION BY TransDate, Category)
                AS P_LOWER_SUM
                , PERCENTILE_CONT (0.5) WITHIN GROUP (ORDER BY Sum_Trans) OVER (PARTITION BY TransDate, Category)
              AS P_MEDIAN_SUM
                , PERCENTILE_CONT (0.95) WITHIN GROUP (ORDER BY Sum_Trans) OVER (PARTITION BY TransDate, Category)
              AS P_UPPER_SUM

FROM large_tbale

with the following non-clustered index on the table

CREATE NONCLUSTERED INDEX MyIndex 
ON large_table (TransDate, Category) 
INCLUDE (Num_Trans, Sum_Trans);

unfortunately, the table creation takes over 30 minutes which is not acceptable for my application.

to boost the performance, I even partitioned the table based on "TransDate" and created index on the partitions. but the execution time does not change significantly by table partitioning either.

Is there anyway to speed up these calculations?

PS- looking at the execution plan, it seems the problem is the fact that for each percentile_cont calculation, the query needs to sort the records. Is it possible to somehow sort "Num_Trans" and "Sum_trans" columns once and then do percentile calculations?

Arash
  • 141
  • 10
  • The query you have has no `WHERE`, so it's going to result in a scan of the *entire* table/index. If that table has 100m+ rows, then the above is going to need scan all 100m rows in the index. After that, it's going to have to perform a (very) expensive `DISTINCT` operation. – Thom A Jan 25 '23 at 14:17
  • so there's no possibility to improve performance? even with partitioning? – Arash Jan 25 '23 at 14:21
  • If you're looking to do a big analytical query like this on a few columns, take a look at using a columnstore index. – dfundako Jan 25 '23 at 14:40

0 Answers0