9

I have a set of data clustering into k groups, each cluster has a minimum size constraint of m

I've done some reclustering of the data. So now I got this set of points that each one has one or more better clusters to be in, but cannot be switched individually because it'll violate the size constraint.

Goal: minimize the sum of distance from each point to its cluster center.

Subject to: Minimum cluster size m

I want to find an algorithm to reassign all points without violating the constraint, while guaranteed to decrease the objective.

I thought of using Graph to represent pair wise relationship between points. But I'm not sure how to do the reassignment since it exists the possibility of a big dense cycle, and I got lost in swapping multiple points between multiple clusters.

I also created a list of cluster pairs with possible swapping candidates, but still couldn't find the way to optimal the objective.

I hope I explained my situation. I'm new to algorithm, and not familiar with the jargon and rules. If any other information needed, please let me know.

I've done a lot of research, I've tried out the algorithm in this paper, but without success, since the sum of membership degree doesn't necessarily correlate to cluster size. Clustering with Size Constraints

I've also read other similar posts on SO, but didn't find a detail algorithm I could implement.

I've tried to construct a weighted directed graph, with vertex representing clusters and edges from A to B represents points in cluster A willing to relocate to cluster B. and weight to be the sum of points

But with my data, it turns out all the nodes to be in a huge cycle with very dense edges. Because of my limited experience, I still could not figure out how to reassign among so many clusters. Any suggestions are appreciated!

Something like this.
enter image description here

qshng
  • 887
  • 1
  • 13
  • 32
  • 1
    Have a look at [this](http://stats.stackexchange.com/questions/5366/clustering-k-means-or-otherwise-with-a-minimum-cluster-size-constraint) over at CrossValidated. – Ami Tavory May 07 '15 at 22:03
  • I actually tried out the algorithm in the paper. Maybe I did something wrong. But somehow I didn't get the desired result. Since the sum of membership is not necessary related to the cluster size. For example u_{1} = [0.45, 0.15, 0.4], u_{2} = [0.45, 0.3, 0.25] and u_{3} = [0.1, 0.55, 0.35] the cluster is still unbalanced. – qshng May 07 '15 at 22:13
  • "size constraints" is too unspecific. Do you mean as in http://stackoverflow.com/questions/5452576/k-means-algorithm-variation-with-equal-cluster-size http://stackoverflow.com/questions/8796682/group-n-points-in-k-clusters-of-equal-size? – Has QUIT--Anony-Mousse May 07 '15 at 22:25

3 Answers3

4

To get a minimal (unfortunately not minimum) solution:

  1. First, greedily recluster any points that you can without violating the minimum size constraint.
  2. Next, build a directed multigraph as follows:
    1. Every cluster becomes a node.
    2. An edge (A,B) is added for every point in A that is closer to the center of B (so that there are potentially multiple edges between the same pair of nodes); its weight should be proportional to the benefit gained by moving it.
  3. Looking for cycles in this graph will allow you to find moves (where a move consists of moving every vertex in the cycle).
  4. Pick the cycle with the highest total weight and recluster the nodes corresponding to the edges.
  5. Repeat steps 1-4 until there are no more cycles.

Creating the graph will have a complexity of O(kn), where you have k and n total points, and can create the same number of multiedges. Tarjan's algorithm will have complexity of O(k2), assuming that you skip multiedges to the same destination in the DFS. Every time you eliminate a cycle, you decrease the global distance by some amount and remove at least one edge from the graph, so the total time of the algorithm cannot exceed O(k4m2). That is quite extravagant; I'm sure there could be heuristics (and probably more detailed analysis) to improve the lower bound.

Kittsil
  • 2,349
  • 13
  • 22
  • Sorry, but I doubt this is correct. The existence of a cycle does not imply a potential move. Even if edges from both A to B and B to A exist, then moving a vertex from A to B might destroy the other edge. – Ami Tavory May 07 '15 at 22:28
  • @AmiTavory You are thinking of doing one operation and then the other. Instead, think of unclustering p in A and q in B, and then reclustering them (one to A and one to B); if p was closer to B and q was closer to A before the unclustering, the global distance will be minimized if p goes to B and q goes to A. You have to consider the entire cycle as a move, not the individual reclusterings. – Kittsil May 07 '15 at 22:51
  • Then imagine an isosceles triangle, with A and B forming the base, and C the apex. Furthermore, there is a cycle ABC. Furthermore, the AB edge is caused by a top-right element in A having an affinity to B. However, its move to B lowers A, so I don't see why it would follow that it doesn't destroy CA. – Ami Tavory May 07 '15 at 23:04
  • @AmiTavory It will certainly destroy CA. The global solution would be more reduced if you then did not move any point from C to A. However, this violates the hard constraint that A must have some minimum size, so you must pick the least harmful point in B or C and move it to A; _because there was previously a cycle,_ we know that this least damaging point is not the point that just left A. – Kittsil May 07 '15 at 23:12
  • Sorry, I don't think so. 1. It will not certainly destroy CA, only possibly. More importantly, 2. There is no reason to assume that any such transfer would necessarily violate the minimal size requirement of one of the clusters. – Ami Tavory May 07 '15 at 23:16
  • @AmiTavory I apologize for my lack of clarity. 1. I meant "There is certainly the possibility that it will destroy ...", not that it would be destroyed every single time. 2. I assumed that any node that could be reclustered without violating the minimum size constraints would have already been reclustered, as that is a much easier problem. These lacks of clarity do not detract from the correctness of the algorithm. – Kittsil May 07 '15 at 23:24
  • Perhaps you should update the answer to reflect your point 2? – Ami Tavory May 07 '15 at 23:36
  • Thanks @Kittsil and Ami! Sorry for my lack of clarity in the original post. I did reclustered the data already so that I get this set of points which can only be reassigned pair-wise. And I've tried to use the directed graph, but came across an issue that almost all vertex form a big cycle. And I'm not sure how to find the best way to recluster the points. I've also updated my original post. Thanks again for your time! – qshng May 08 '15 at 00:00
  • @qshngv Unfortunately, finding the cycles is only the first step. You must then pick one cycle (any cycle) and recluster the points corresponding to the cycle edges, and then iteratively repeat both steps until there are no more cycles. Again, this will not provide you with a globally optimum solution, but only with an optimal solution. – Kittsil May 08 '15 at 00:17
  • I have further questions about some scenarios, 1. Should the edges in the graph represent individual points, or it's weighted with all points adding up. 2. if there are multiples points or edges (in the case they represent individual point) in a cycle, which one should I select to remove? Does it matter if I pick the one reduce the global solution the most or not? 3. Same for the case that one point have multiple closer clusters, which cluster should it go in? Thanks! – qshng May 08 '15 at 01:46
  • Hi @Kittsil , I posted a follow-up question on this algorithm. http://stackoverflow.com/questions/30128334/removing-cycles-in-weighted-directed-graph Could you have a look at it please? Thanks! – qshng May 08 '15 at 16:13
  • @qshngv I did reclarified how edges are formed in this algorithm. I will address your individual cases in the other question. – Kittsil May 08 '15 at 17:17
  • @Kittsil oh sorry that I didn't understand it clearly at first. And thanks a lot for the clarification. I just found that I had a different understanding about the formation of the edges and the weight. I'll edit my other post accordingly. Thanks again for your time! – qshng May 08 '15 at 17:22
3

Try this: pip install k-means-constrained and then

from k_means_constrained import KMeansConstrained
KMeansConstrained(n_clusters=8, size_min=None, size_max=None, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=False, random_state=None, copy_x=True, n_jobs=1)

sources:

https://pypi.org/project/k-means-constrained/

https://joshlk.github.io/k-means-constrained/

afsharov
  • 4,774
  • 2
  • 10
  • 27
  • You need to specify the minimum cluster size using the `size_min` parameter. For example if it the minim size is 10 and input data is X: `clf = KMeansConstrained(n_clusters=8, size_min=10); labels = clf.fit_predict(X)` – joshlk Dec 29 '20 at 23:32
2

This problem is addressed in this paper:

Bradley, P. S., K. P. Bennett, and Ayhan Demiriz. "Constrained k-means clustering." Microsoft Research, Redmond (2000): 1-8.

We propose explicitly adding $k$ constraints to the underlying clustering optimization problem requiring that cluster $h$ contain at least $\tau_h$ points.

I have an implementation of the algorithm in python.