8

I have a scenario where i'm splitting a number of results into quartilies using the SQL Server NTILE function below. The goal is to have an as equal number of rows in each class

case NTILE(4) over (order by t2.TotalStd) 
   when 1 then 'A' when 2 then 'B' when 3 then 'C' else 'D' end as Class

The result table is shown below and there is a (9,9,8,8) split between the 4 class groups A,B,C and D.

There are two results which cause me an issue, both rows have a same total std value of 30 but are assigned to different quartiles.

8   30  A
2   30  B

I'm wondering is there a way to ensure that rows with the same value are assigned to the same quartile? Can i group or partition by another column to get this behaviour?

Pos TotalStd    class
1   16  A
2   23  A
3   21  A
4   29  A
5   25  A
6   26  A
7   28  A
8   30  A
9   29  A
1   31  B
2   30  B
3   32  B
4   32  B
5   34  B
6   32  B
7   34  B
8   32  B
9   33  B
1   36  C
2   35  C
3   35  C
4   35  C
5   40  C
6   38  C
7   41  C
8   43  C
1   43  D
2   48  D
3   45  D
4   47  D
5   44  D
6   48  D
7   46  D
8   57  D
emeraldjava
  • 10,894
  • 26
  • 97
  • 170
  • 2
    This is how `NTILE` works, what happens if you all your base have the same `TotalStd` and want to do quartiles?, it will split the values on 4 different groups anyway – Lamak Feb 17 '12 at 16:26

4 Answers4

9

You will need to re create the Ntile function, using the rank function. The rank function gives the same rank for rows with the same value. The value later 'jumps' to the next rank as if you used row_number. We can use this behavior to mimic the Ntile function, forcing it to give the same Ntile value to rows with the same value. However - this will cause the Ntile partitions to be with a different size. See the example below for the new Ntile using 4 bins:

declare @data table ( x int )

insert @data values 
(1),(2),
(2),(3),
(3),(4),
(4),(5)

select  
    x,
    1+(rank() over (order by x)-1) * 4 / count(1) over (partition by (select 1)) as new_ntile
from @data

Results:

x   new_ntile
---------------
1   1
2   1
2   1
3   2
3   2
4   3
4   3
5   4
Roee Anuar
  • 3,071
  • 1
  • 19
  • 33
  • can you please explain how it works? and what is `partition by (select 1)` supposed to do? – Przemyslaw Remin Aug 24 '17 at 19:41
  • The "over partition by" clause runs the "count(1)" aggregation over each value in the "partition by" column. Assume you have an aggregated sales table with 3 fields - store, weekday, sold_amount. You want to know the share of sales for each store in each weekday. You can use this query: "Select store, weekday, sold_amount / sum(sold_amount) over (partition by weekday)". If you don't have a weekday, to run the over partition by you need to indicate a partitioning column. Since there is none, you use this trick : "select store, sold_amount / sum(sold_amount) over (partition by (select 1)) – Roee Anuar Aug 27 '17 at 06:02
  • Do we have to use the same `partition by` on both sides of division? I mean: `1+(rank() over (partition by store, weekday order by sales)-1) * 4 / count(1) over (partition by store, weekday)` – Przemyslaw Remin Aug 28 '17 at 12:21
  • If you want to run the Ntile inside these groups - then yes. – Roee Anuar Aug 28 '17 at 14:06
  • I think your solution does not guarantee that the same value of sales will not jump into two different tiles. I have, say, half of sales cases with the same value and the other sales cases diverse. So I expected the first half of sales cases to jump into the first tile. And it is not so. Or I am doing something wrong. – Przemyslaw Remin Aug 29 '17 at 07:13
  • 1
    As long as the Rank function gives the same rank for same values - they are guaranteed to be in the same bin. Note that some bins could remain empty in this case. You can send me your code and data to for inspection. – Roee Anuar Aug 29 '17 at 08:08
  • I would add for people looking to partion further more this ranking : 1+(dense_rank() over (partition by A ORDER BY B DESC)-1) * 4 / (dense_rank() over (partition by A ORDER BY B DESC) + dense_rank() over (partition by A ORDER BY B) - 1) as new_ntile. Based on [this answer on how to count distinct](https://stackoverflow.com/a/22347502/4346121) – SJGD Nov 11 '20 at 06:20
3

Not sure what you're expecting to happen here, really. SQL Server has divided up the data into 4 groups of as-equal-size-as-possible, as you asked. What do you want to happen? Have a look at this example:

declare @data table ( x int )

insert @data values 
(1),(2),
(2),(3),
(3),(4),
(4),(5)

select  
    x,
    NTILE(4) over (order by x) as ntile
from @data

Results:

x           ntile
----------- ----------
1           1
2           1
2           2
3           2
3           3
4           3
4           4
5           4

Now every ntile group shares a value with the one(s) next to it! But what else should it do?

AakashM
  • 62,551
  • 17
  • 151
  • 186
2

Try this:

; with a as (
                select TotalStd,Class=case ntile(4)over( order by TotalStd )
                                when 1 then 'A'
                                when 2 then 'B'
                                when 3 then 'C'
                                when 4 then 'D'
                                end
                from t2
                group by TotalStd
)
select d.*, a.Class from t2 d
inner join a on a.TotalStd=d.TotalStd
order by Class,Pos;
John Dewey
  • 6,985
  • 3
  • 22
  • 26
1

Here we have a table of 34 rows.

DECLARE @x TABLE (TotalStd INT) 
INSERT @x (TotalStd) VALUES (16), (21), (23), (25), (26), (28), (29), (29), (30), (30), (31), (32), (32), (32), (32), (33), (34), 
    (34), (35), (35), (35), (36), (38), (40), (41), (43), (43), (44), (45), (46), (47), (48), (48), (57)
SELECT '@x', TotalStd FROM @x ORDER BY TotalStd

We want to divide into quartiles. If we use NTILE, the bucket sizes will be roughly the same size (8 to 9 rows each) but ties are broken arbitrarily:

SELECT '@x with NTILE', TotalStd, NTILE(4) OVER (ORDER BY TotalStd) quantile FROM @x

See how 30 appears twice: once in quantile 1 and once in quantile 2. Similarly, 43 appears both in quantiles 3 and 4.

What I ought to find is 10 items in quantile 1, 8 in quantile 2, 7 in quantile 3 and 9 in quantile 4 (i.e. not a perfect 9-8-9-8 split, but such a split is impossible if we are not allowed to break ties arbitrarily). I can do it using NTILE to determine cutoff points in a temporary table:

DECLARE @cutoffs TABLE (quantile INT, min_value INT, max_value INT)

INSERT @cutoffs (quantile, min_value)
SELECT y.quantile, MIN(y.TotalStd)
FROM (SELECT TotalStd, NTILE(4) OVER (ORDER BY TotalStd) AS quantile FROM @x) y
GROUP BY y.quantile

-- The max values are the minimum values of the next quintiles
UPDATE c1 SET c1.max_value = ISNULL(C2.min_value, (SELECT MAX(TotalStd) + 1 FROM @x))
FROM @cutoffs c1 LEFT OUTER JOIN @cutoffs c2 ON c2.quantile - 1 = c1.quantile

SELECT '@cutoffs', * FROM @cutoffs

We'll use the the boundary values in the @cutoffs table to create the final table:

SELECT x.TotalStd, c.quantile FROM @x x 
    INNER JOIN @cutoffs c ON x.TotalStd >= c.min_value AND x.TotalStd < c.max_value
lebelinoz
  • 4,890
  • 10
  • 33
  • 56