109

So let's say I have an array like this:

[1,1,2,3,10,11,13,67,71]

Is there a convenient way to partition the array into something like this?

[[1,1,2,3],[10,11,13],[67,71]]

I looked through similar questions yet most people suggested using k-means to cluster points, like scipy, which is quite confusing to use for a beginner like me. Also I think that k-means is more suitable for two or more dimensional clustering right? Are there any ways to partition an array of N numbers to many partitions/clustering depending on the numbers?

Some people also suggest rigid range partitioning, but it doesn't always render the results as expected

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
E.H.
  • 3,271
  • 4
  • 19
  • 18

6 Answers6

153

Don't use multidimensional clustering algorithms for a one-dimensional problem. A single dimension is much more special than you naively think, because you can actually sort it, which makes things a lot easier.

In fact, it is usually not even called clustering, but e.g. segmentation or natural breaks optimization.

You might want to look at Jenks Natural Breaks Optimization and similar statistical methods. Kernel Density Estimation is also a good method to look at, with a strong statistical background. Local minima in density are be good places to split the data into clusters, with statistical reasons to do so. KDE is maybe the most sound method for clustering 1-dimensional data.

With KDE, it again becomes obvious that 1-dimensional data is much more well behaved. In 1D, you have local minima; but in 2D you may have saddle points and such "maybe" splitting points. See this Wikipedia illustration of a saddle point, as how such a point may or may not be appropriate for splitting clusters.

See this answer for an example how to do this in Python (green markers are the cluster modes; red markers a points where the data is cut; the y axis is a log-likelihood of the density):

KDE with Python

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • 3
    Implementation here: http://www.macwright.org/2013/02/18/literate-jenks.html – Tirno Jan 17 '14 at 23:10
  • 1
    Could you update your answer with why ```meanshift``` or ```dbscan``` may or may not be good approaches to clustering 1D? See http://scikit-learn.org/stable/modules/clustering.html – opyate Mar 04 '15 at 14:35
  • 5
    Essentially, both are very *naive* approximations to Kernel Density Estimation. Mean-Shift is a mode-seeking approach for multivariate KDE, and DBSCAN is using the most primitive KDE (box kernel) to define what is dense and what is not. There is 0 benefit to use them *on 1-dimensional data*. – Has QUIT--Anony-Mousse Mar 04 '15 at 19:46
  • 3
    Ckmeans.1d.dp (k-means adapted for dimensional clustering) is worth a look however. See https://journal.r-project.org/archive/2011-2/RJournal_2011-2_Wang+Song.pdf – skoush Aug 27 '16 at 17:28
  • 1
    @skoush that is a *slower* k-means variant that yields the global optimum (in 1d only). But if the SSQ k-means objective doesn't solve your problem it does not matter if you find a 0.1% better (by SSQ) k-means solution than with the faster standard algorithm. – Has QUIT--Anony-Mousse Aug 27 '16 at 17:35
  • is there a java implementation avaliable? –  Jul 25 '21 at 14:54
  • You interpretation of clustering is very particular and narrow. Not all 1D data points have distributions like the one you've plotted. Your suggested methods will suffer to cluster heavily tailed data sets, e.g. scree plots of eigenvalue decomposition. – Ash Mar 16 '23 at 15:03
13

This simple algorithm works:

points = [0.1, 0.31,  0.32, 0.45, 0.35, 0.40, 0.5 ]

clusters = []
eps = 0.2
points_sorted = sorted(points)
curr_point = points_sorted[0]
curr_cluster = [curr_point]
for point in points_sorted[1:]:
    if point <= curr_point + eps:
        curr_cluster.append(point)
    else:
        clusters.append(curr_cluster)
        curr_cluster = [point]
    curr_point = point
clusters.append(curr_cluster)
print(clusters)

The above example clusters points into a group, such that each element in a group is at most eps away from another element in the group. This is like the clustering algorithm DBSCAN with eps=0.2, min_samples=1. As others noted, 1d data allows you to solve the problem directly, instead of using the bigger guns like DBSCAN.

The above algorithm is 10-100x faster for some small datasets with <1000 elements I tested.

tyrex
  • 8,208
  • 12
  • 43
  • 50
4

You may look for discretize algorithms. 1D discretization problem is a lot similar to what you are asking. They decide cut-off points, according to frequency, binning strategy etc.

weka uses following algorithms in its , discretization process.

weka.filters.supervised.attribute.Discretize

uses either Fayyad & Irani's MDL method or Kononeko's MDL criterion

weka.filters.unsupervised.attribute.Discretize

uses simple binning

Atilla Ozgur
  • 14,339
  • 3
  • 49
  • 69
3

CKwrap is a fast and straightforward k-means clustering function, though a bit light on documentation.

Example Usage

pip install ckwrap

import ckwrap

nums= np.array([1,1,2,3,10,11,13,67,71])
km = ckwrap.ckmeans(nums,3)

print(km.labels)
# [0 0 0 0 1 1 1 2 2]


buckets = [[],[],[]]
for i in range(len(nums)):
    buckets[km.labels[i]].append(nums[i])
print(buckets)
# [[1, 1, 2, 3], [10, 11, 13], [67, 71]]
exit()

I expect the authors intended you to make use of the nd array functionality rather than create a list of lists.

other measures:

km.centers
km.k
km.sizes
km.totss
km.betweenss
km.withinss

The underlying algorithm is based on this article.

Ian Campbell
  • 355
  • 2
  • 6
1

Late response and just for the record. You can partition a 1D array using Ckmeans.1d.dp.

This method guarantees optimality and it is O(n^2), where n is the num of observations. The implementation is in C++ and there is a wrapper in R.

1

The code for Has QUIT--Anony-Mousse's answer to Clustering values by their proximity in python (machine learning?)

When you have 1-dimensional data, sort it, and look for the largest gaps

I only added that gaps need to be relatively large

import numpy as np
from scipy.signal import argrelextrema
# lst = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]
lst = [1,1,2,3,10,11,13,67,71]
lst.sort()
diff = [lst[i] - lst[i-1] for i in range(1, len(lst))]
rel_diff = [diff[i]/lst[i] for i in range(len(diff))]
arg = argrelextrema(np.array(rel_diff), np.greater)[0]
last = 0
for x in arg:
    print(f'{last}:{x + 1} {lst[last:x + 1]}')
    last = x + 1
print(f'{last}: {lst[last:]}')

output:

0:2 [1, 1]
2:4 [2, 3]
4:7 [10, 11, 13]
7: [67, 71]
e271p314
  • 3,841
  • 7
  • 36
  • 61
  • I prefer iterations over list elements to indexes (e.g. `[y-x for x, y in zip(lst[:-1], lst[1:])]`), but algorithmically a very pragmatic answer. In particular, the complexity is the same as for sort, so `O(n log(n))`. – d4tm4x Mar 28 '23 at 10:30