0

I've a range of data values. I want to split it into ranges, say 3 ranges, based on the value. For ex: assume my data values look like this

1669
2653
2662
2669
2701
2711
2748
2770
2821
2832
2848
2850
2855
2859
2868
2893
2908
2922
2939
2968
2992
3005
3025
3026
3031
3033
3034
3036
3053
3073
3087
3087
3099
3128
3128
3131
3133
3140
3174
3182
3182
3189
3204
3204
3214
3219
3224
3225
3236
3244
3272
3295
3295
3296
3299
3305
3305
3402

My end goal is to be able to cleverly determine the low-medium-high from this set - meaning, I need code to find out which will 2 good boundary values - for something like <2600, >=2600 to 3000, >=3000. Something like K-means clustering... The set of values will always change, and may be even in 10 or 100s, instead of 1000s here.

The attached image show the distribution of the data points across dates on the x axis. Visually it becomes clear how the data is distributed. I want to be able to do the same via SQL

NTILE in SQL breaks into equal buckets and hence does not fit my requirement. Any ideas ?

codingIsCool
  • 64
  • 1
  • 9
  • 1
    You need to define who you want to break up the data. Then you can ask about how to implement the method in SQL. – Gordon Linoff Apr 29 '19 at 11:03
  • Thanks for the quick response, but not sure I understand, Gordon. The SQL code should do it, if that answers... I will feed the boundaries so determined to a visualization which will then show the 3 ranges in say 3 different colors. – codingIsCool Apr 29 '19 at 11:12
  • You could simply calculate (max - min) * .33 and (max - min) * .66. Or something more complex percentile. But you have to be specific. – Salman A Apr 29 '19 at 11:24
  • @codingIsCool . . . You have to define what "good" means for this purpose. That is subjective. – Gordon Linoff Apr 29 '19 at 11:32

1 Answers1

0

I dont know if I understood what you want, but if you want to divide a set of data to 3 groups and find the ranges for each group I think the below will help using NTILE.

First I will add the data to a memory table to test with, as below:-

declare @dataSet table ([value] int)
insert into @dataSet values 
(1669 ),(2653 ),(2662 ),(2669 ),(2701 ),(2711 ),(2748 ),(2770 ),(2821 ),(2832 ),(2848 ),(2850 ),(2855 ),
(2859 ),(2868 ),(2893 ),(2908 ),(2922 ),(2939 ),(2968 ),(2992 ),(3005 ),(3025 ),(3026 ),(3031 ),(3033 ),
(3034 ),(3036 ),(3053 ),(3073 ),(3087 ),(3087 ),(3099 ),(3128 ),(3128 ),(3131 ),(3133 ),(3140 ),(3174 ),
(3182 ),(3182 ),(3189 ),(3204 ),(3204 ),(3214 ),(3219 ),(3224 ),(3225 ),(3236 ),(3244 ),(3272 ),(3295 ),
(3295 ),(3296 ),(3299 ),(3305 ),(3305 ),(3402 )

Then will try to get the ranges.

    ;with cte as (
    select NTILE(3) over (order by [value]) [group],[value] from @dataSet
    )
    select [group],min([value]) [LowerRange],max([value]) [HigherRange] From cte group by [group]

The grouping is to get the upper and lower range for each group

Result will be as below:-

group   LowerRange  HigherRange
=====   ==========  ==========
1       1669        2968
2       2992        3174
3       3182        3402

hope this helps.

Ali Al-Mosawi
  • 783
  • 6
  • 12
  • HI Ali - thanks. NTILE breaks the set into 3 equal buckets. What I need is 3 clever buckets based on the data. For ex, in the sample dataset provided, it could be < 2650, >2650 and <3000 and > 3000. The breakup should clearly depict the concentration of data points. – codingIsCool Apr 30 '19 at 05:05
  • In your question your mentioned that NTILE fits your requirements, this is why I went with it. I dont know if there is a build in function to get the concentration of data points. – Ali Al-Mosawi Apr 30 '19 at 19:20
  • This may help you check it out, https://stackoverflow.com/questions/39230362/microsoft-sql-and-r-stored-procedure-and-k-means – Ali Al-Mosawi Apr 30 '19 at 19:48
  • Thanks Ali. I realized the mistake in the text of my question regd the NTILE thing now. Sorry for the confusion.And the link you pasted last point me in the right direction. Wonder why I didn't think about that earlier. – codingIsCool May 02 '19 at 05:24