2

I have a list, like the following -

a =[9,10,11,11,12,52,49,51,50,55,51,52,54,71,72,70,69,70,110,111,113,114]

As you can see numbers are usually clustered into several points. The cluster can happen anywhere, near 10, or 50, or even 500. There is no uniformity in that. However, they will always be in the range from -5, +5 from the mean of the cluster. Like - int value of mean [9, 10, 11, 11, 12] is 11, and all the numbers in this cluster will be between 6 and 17.

I want to return a new list with the clustered number into sub list - something like -

b =[[9, 10, 11, 11, 12], [49, 50, 51, 51, 52, 52, 54, 55], 
[69, 70, 70, 71, 72], [110, 111, 113, 114]]

Is there anyway to answer that?

Sourav
  • 816
  • 3
  • 11
  • 26
  • After a quick search, someone already asked for this :) [finding-clusters-of-numbers-in-a-list](https://stackoverflow.com/questions/15800895/finding-clusters-of-numbers-in-a-list) – LazyGoose Mar 02 '19 at 23:20
  • This isn't well defined. What do you do with `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]`? Which numbers do you take the mean over? – joel Mar 02 '19 at 23:31
  • @JoelBerkeley The list I will be getting won't have value like that. They will be more like - [10,11,12,10,12,50,51,52,53,101,102,103]. They are very clustered and one cluster far away from another. – Sourav Mar 02 '19 at 23:37
  • Do you mean you can guarantee that there will be no clusters of numbers with difference more than 5 apart? And have you looked at the linked qu? – joel Mar 02 '19 at 23:39
  • Yes, Joel. They won't be more tha 5 apart. I have a image of bricks - one brick is stacked exactly upon another. They are same shape and size. If I get the mean column number of each bricks then they must have relatively similar value. One stack of bricks are far from another actually. – Sourav Mar 02 '19 at 23:44
  • I had an answer for this, but could not post because question got closed... – darksky Mar 02 '19 at 23:45
  • The gist of it is that you can use `sklearn`'s `MeanShift` for this task to get a pretty decent result depending on the choice of bandwidth. – darksky Mar 02 '19 at 23:47
  • Here's a question that demonstrates how this can be done https://stackoverflow.com/questions/18364026/clustering-values-by-their-proximity-in-python-machine-learning – darksky Mar 02 '19 at 23:48
  • @darksky the question has a duplication. Is there anyway to reopen the question for answer? – Sourav Mar 02 '19 at 23:49
  • 1
    I do not believe so, unless some mod will reopen which is highly unlikely. Try going to the link that I gave above and put `bandwidth = 10`. – darksky Mar 02 '19 at 23:52

0 Answers0