Efficiently grouping similar numbers together

Question

Possible Duplicate:
1D Number Array Clustering

I have an array of numbers like [1, 20, 300, 45, 5, 60, 10, 270, 3]. What is an efficient algorithm for grouping these numbers together based on proximity? In this case I'd expect something like [1, 3, 5], [20, 45, 60] and [270, 300].

It seems to me the keyword you are looking for is Clustering: http://en.wikipedia.org/wiki/Cluster_analysis. In your particular case, I would start with a bottom-up hierarchical clustering approach: http://en.wikipedia.org/wiki/Hierarchical_clustering — Mathias, Dec 28 '12 at 23:51
possible duplicate of [1D Number Array Clustering](http://stackoverflow.com/questions/11513484/1d-number-array-clustering), [Number clustering/partitioning algorithm](http://stackoverflow.com/questions/8140036/number-clustering-partitioning-algorithm), [Cluster one-dimensional data optimally?](http://stackoverflow.com/questions/7869609/cluster-one-dimensional-data-optimally) and many more. **Use the search function, Vlad**! — Has QUIT--Anony-Mousse, Dec 29 '12 at 08:24
@Mathias no: Clustering is appropriate when you have multiple dimensions. When data is 1d, it can be ordered, and processed *much* more efficiently this way. Jenks natural breaks optimization is a good keyword, but there are very simple delta-based methods that work well. And most of these very good methods do not even scale to 2 dimensions. — Has QUIT--Anony-Mousse, Dec 29 '12 at 08:27

score 4 · Accepted Answer · edited May 23 '17 at 12:09

The hardest part of what you are asking is how to actually define proximity. What would you expect the output to be from [5,10,15,20]? Would it be the same groupings as for [500,1000,1500,2000]?

What about [1,2,3,5,7,8,9]? Should there be one group or three? (or two?).
What about [1,2,3,5,7,8,9,1075,4000]? Do 1075 and 4000 become grouped together? Does the groupings of the smaller numbers become changed by the larger numbers in the sample?

This question is something asked by an entire field of Machine Learning: Cluster Analysis Perhaps this related question will help?

I think what you want is K-means clustering (helpfully linked to in the related question), but you need to know how many groups you want to split your data into to use it.

For **1 dimensional data** there exist much more efficient methods. You should *not* be using a multivariate method such as k-means. Instead, sort the data set (in `O(n log n)`) and then look for optimal partitioning strategies suc h as natural breaks, maximum gaps, minimum kernel density estimation etc. **Sorting is key**. — Erich Schubert, Jan 01 '13 at 11:52

score 3 · Answer 2 · answered Dec 28 '12 at 23:51

3

This might be massive overkill, but you might want to look into hierarchical clustering algorithms. These algorithms group together values into a hierarchy, from which you can easily extract the best k clusters. Agglomerative clustering is probably the easiest of these approaches to implement, and from experience it tends to produce very good clusters.

Hope this helps!

answered Dec 28 '12 at 23:51

templatetypedef

362,284
104
897
1,065

Actually these algorithms are designed for multidimensional data. For sigle dimensional data, they compute the pairwise differences and that does not make a lot of sense when the data set can be sorted. – Has QUIT--Anony-Mousse Dec 29 '12 at 08:28

Efficiently grouping similar numbers together

2 Answers2

Linked