Splitting data into ranges based on value

Question

I've a range of data values. I want to split it into ranges, say 3 ranges, based on the value. For ex: assume my data values look like this

My end goal is to be able to cleverly determine the low-medium-high from this set - meaning, I need code to find out which will 2 good boundary values - for something like <2600, >=2600 to 3000, >=3000. Something like K-means clustering... The set of values will always change, and may be even in 10 or 100s, instead of 1000s here.

The attached image show the distribution of the data points across dates on the x axis. Visually it becomes clear how the data is distributed. I want to be able to do the same via SQL

NTILE in SQL breaks into equal buckets and hence does not fit my requirement. Any ideas ?

You need to define who you want to break up the data. Then you can ask about how to implement the method in SQL. — Gordon Linoff, Apr 29 '19 at 11:03
Thanks for the quick response, but not sure I understand, Gordon. The SQL code should do it, if that answers... I will feed the boundaries so determined to a visualization which will then show the 3 ranges in say 3 different colors. — codingIsCool, Apr 29 '19 at 11:12
You could simply calculate (max - min) * .33 and (max - min) * .66. Or something more complex percentile. But you have to be specific. — Salman A, Apr 29 '19 at 11:24
@codingIsCool . . . You have to define what "good" means for this purpose. That is subjective. — Gordon Linoff, Apr 29 '19 at 11:32

score 0 · Answer 1 · answered Apr 29 '19 at 21:10

I dont know if I understood what you want, but if you want to divide a set of data to 3 groups and find the ranges for each group I think the below will help using NTILE.

First I will add the data to a memory table to test with, as below:-

declare @dataSet table ([value] int)
insert into @dataSet values 
(1669 ),(2653 ),(2662 ),(2669 ),(2701 ),(2711 ),(2748 ),(2770 ),(2821 ),(2832 ),(2848 ),(2850 ),(2855 ),
(2859 ),(2868 ),(2893 ),(2908 ),(2922 ),(2939 ),(2968 ),(2992 ),(3005 ),(3025 ),(3026 ),(3031 ),(3033 ),
(3034 ),(3036 ),(3053 ),(3073 ),(3087 ),(3087 ),(3099 ),(3128 ),(3128 ),(3131 ),(3133 ),(3140 ),(3174 ),
(3182 ),(3182 ),(3189 ),(3204 ),(3204 ),(3214 ),(3219 ),(3224 ),(3225 ),(3236 ),(3244 ),(3272 ),(3295 ),
(3295 ),(3296 ),(3299 ),(3305 ),(3305 ),(3402 )

Then will try to get the ranges.

    ;with cte as (
    select NTILE(3) over (order by [value]) [group],[value] from @dataSet
    )
    select [group],min([value]) [LowerRange],max([value]) [HigherRange] From cte group by [group]

The grouping is to get the upper and lower range for each group

Result will be as below:-

group   LowerRange  HigherRange
=====   ==========  ==========
1       1669        2968
2       2992        3174
3       3182        3402

hope this helps.

HI Ali - thanks. NTILE breaks the set into 3 equal buckets. What I need is 3 clever buckets based on the data. For ex, in the sample dataset provided, it could be < 2650, >2650 and <3000 and > 3000. The breakup should clearly depict the concentration of data points. — codingIsCool, Apr 30 '19 at 05:05
In your question your mentioned that NTILE fits your requirements, this is why I went with it. I dont know if there is a build in function to get the concentration of data points. — Ali Al-Mosawi, Apr 30 '19 at 19:20
This may help you check it out, https://stackoverflow.com/questions/39230362/microsoft-sql-and-r-stored-procedure-and-k-means — Ali Al-Mosawi, Apr 30 '19 at 19:48
Thanks Ali. I realized the mistake in the text of my question regd the NTILE thing now. Sorry for the confusion.And the link you pasted last point me in the right direction. Wonder why I didn't think about that earlier. — codingIsCool, May 02 '19 at 05:24

Splitting data into ranges based on value

1 Answers1