Clustering positions on a map where each cluster has an equal number of points

Question

I have specific points on the map and I need them to be grouped to different clusters with the same size and the last cluster can be count %n. I read these answers 1, 2, and 3 but it does not help. I have tried different way but none of them works. In this code, I specific the n_clusters=4 because this is the best number of a cluster that I can sort them and take n best points from the sorted point, and then I will go through all the points. For example, I need the 32 points that shown in the figure to be cluster to 4 clusters and every cluster has 8 points

dfcluster = DataFrame(position, columns=['x', 'y'])
kmeans = KMeans(n_clusters=4).fit(dfcluster)
centroids = kmeans.cluster_centers_

# plt.scatter(dfcluster['x'], dfcluster['y'], c=kmeans.labels_.astype(float), s=50, alpha=0.5)
# plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50)
# plt.show()
dfcluster['cluster'] = kmeans.labels_
dfcluster=dfcluster.drop_duplicates(['x', 'y'], keep='last')
dfcluster = dfcluster.sort_values(['cluster', 'x', 'y'], ascending=True)
# d=pd.DataFrame()
# m = pd.DataFrame()
# n=8
# for x in range(4) :
#     m= dfcluster[dfcluster.cluster == x]
#
#
#     if len(m) > int( n /2)-1:
#         m=m.head(int(n/2)-1)
#         # for idx, row in m.iterrows():
#         #     print("code3 group",  "=", row['cluster'])
#         d=d.append(m,ignore_index = True)
#
#     else :
#         d=d.append(m,ignore_index = True)
#
#
# if len(d)>=n:
#     dfcluster = d
# dfcluster.groupby('cluster').nth(n))
dfcluster=dfcluster.head(n)
i=0
if (len(dfcluster )< n):
   change_df()

also this parameter `n_clusters=4` controls the aspect you are talking about, and I'm not sure with clustering you get to decide with such detail (this number of things in this many groups). I think part of the idea is that the machine is responsible for deciding if that configuration makes sense, if it does not make sense, it won't do it, as long as your data is sufficient and applicable to what you are trying to do. seek a second opinion though — ewizard, May 08 '20 at 15:21
can you provide an example input and what's the expected output, or an example that would allow to understand what you want to achieve? What is your clustering criteria? Here you are using KMeans, but we don't know what features your points have. You are also spcifying 4 clusters... So should it be 4 clusters? — dzang, May 08 '20 at 15:23
@dzang Thanks for your reply, the points change in the map and I need to make sure that the points can be clustered. I specific the `n_clusters=4` because this is the best number of a cluster that I can sort them and take `n` best points from the sorted points. — I_Al-thamary, May 08 '20 at 16:23
what do you mean by selecting the best points? I would like to point to you that you are not getting answers because your issue is not formulated clearly. If you would give an example with some test data, and what you would want to get out of it, that would help to understand what you want to achieve. If you want to spatially divide points based on the location, what do you mean by "cluster of the same size"? same number of points or same spatial extension? I suggest you take some time in reformulating the question, that will help you more than the bounty. — dzang, May 11 '20 at 12:28
@dzang Thanks for your reply, For example, I need the `32` points that shown in the figure to be cluster to `4` clusters and every cluster has `8` points. — I_Al-thamary, May 11 '20 at 13:32

I_Al-thamary · Accepted Answer · 2021-10-11T13:30:47.390

I found this module that uses the Same Size Constrained K-Means Heuristics: Use Heuristics methods to reach the same size clustering where it gives the same size of the group.

I start with pip install size-constrained-clustering or pip install git+https://github.com/jingw2/size_constrained_clustering.git, and you can use minmax flow or Heuristics.

n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)

model = equal.SameSizeKMeansMinCostFlow(n_clusters)

#model = equal.SameSizeKMeansHeuristics(n_clusters)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

If you have a problem with the size-constrained-clustering module, you can use these classes, but you need to install k-means-constrained

pip install k-means-constrained

SameSizeKMeansMinCostFlow class

from k_means_constrained import KMeansConstrained
import warnings
import base
from scipy.spatial.distance import cdist
class SameSizeKMeansMinCostFlow(base.Base):

    def __init__(self, n_clusters, max_iters=1000, distance_func=cdist, random_state=42):
        '''
        Args:
            n_clusters (int): number of clusters
            max_iters (int): maximum iterations
            distance_func (object): callable function with input (X, centers) / None, by default is l2-distance
            random_state (int): random state to initiate, by default it is 42
        '''
        super(SameSizeKMeansMinCostFlow, self).__init__(n_clusters, max_iters, distance_func)
        self.clf = None

    def fit(self, X):
        n_samples, n_features = X.shape
        minsize = n_samples // self.n_clusters
        maxsize = (n_samples + self.n_clusters - 1) // self.n_clusters

        clf = KMeansConstrained(self.n_clusters, size_min=minsize,
                                size_max=maxsize)

        if minsize != maxsize:
            warnings.warn("Cluster minimum and maximum size are {} and {}, respectively".format(minsize, maxsize))

        clf.fit(X)

        self.clf = clf
        self.cluster_centers_ = self.clf.cluster_centers_
        self.labels_ = self.clf.labels_

    def predict(self, X):
        return self.clf.predict(X)

base class

#!usr/bin/python 3.7
#-*-coding:utf-8-*-

'''
@file: base.py, base for clustering algorithm
@Author: Jing Wang (jingw2@foxmail.com)
@Date: 06/07/2020
'''
from scipy.spatial.distance import cdist
import numpy as np 
import warnings
import scipy.sparse as sp

import os 
import sys 
path = os.path.dirname(os.path.abspath(__file__))
sys.path.append(path)
from k_means_constrained.sklearn_import.utils.extmath import stable_cumsum

class Base(object):

    def __init__(self, n_clusters, max_iters, distance_func=cdist):
        '''
        Base Cluster object

        Args:
            n_clusters (int): number of clusters 
            max_iters (int): maximum iterations
            distance_func (callable function): distance function callback
        '''
        assert isinstance(n_clusters, int)
        assert n_clusters >= 1
        assert isinstance(max_iters, int)
        assert max_iters >= 1
        self.n_clusters = n_clusters 
        self.max_iters = max_iters
        if distance_func is not None and not callable(distance_func):
            raise Exception("Distance function is not callable")
        self.distance_func = distance_func

    def fit(self, X):
        pass 

    def predict(self, X):
        pass 

def k_init(X, n_clusters, x_squared_norms, random_state=42, distance_func=cdist, n_local_trials=None):
    """Init n_clusters seeds according to k-means++

    Parameters
    ----------
    X : array or sparse matrix, shape (n_samples, n_features)
        The data to pick seeds for. To avoid memory copy, the input data
        should be double precision (dtype=np.float64).

    n_clusters : integer
        The number of seeds to choose

    x_squared_norms : array, shape (n_samples,)
        Squared Euclidean norm of each data point.

    random_state : int, RandomState instance
        The generator used to initialize the centers. Use an int to make the
        randomness deterministic.
        See :term:`Glossary <random_state>`.

    n_local_trials : integer, optional
        The number of seeding trials for each center (except the first),
        of which the one reducing inertia the most is greedily chosen.
        Set to None to make the number of trials depend logarithmically
        on the number of seeds (2+log(k)); this is the default.

    Notes
    -----
    Selects initial cluster centers for k-mean clustering in a smart way
    to speed up convergence. see: Arthur, D. and Vassilvitskii, S.
    "k-means++: the advantages of careful seeding". ACM-SIAM symposium
    on Discrete algorithms. 2007

    Version ported from http://www.stanford.edu/~darthur/kMeansppTest.zip,
    which is the implementation used in the aforementioned paper.
    """
    n_samples, n_features = X.shape

    centers = np.empty((n_clusters, n_features), dtype=X.dtype)

    assert x_squared_norms is not None, 'x_squared_norms None in _k_init'

    # Set the number of local seeding trials if none is given
    if n_local_trials is None:
        # This is what Arthur/Vassilvitskii tried, but did not report
        # specific results for other than mentioning in the conclusion
        # that it helped.
        n_local_trials = 2 + int(np.log(n_clusters))

    # Pick first center randomly
    center_id = random_state.randint(n_samples)
    if sp.issparse(X):
        centers[0] = X[center_id].toarray()
    else:
        centers[0] = X[center_id]

    # Initialize list of closest distances and calculate current potential
    closest_dist_sq = distance_func(
        centers[0, np.newaxis], X)
    current_pot = closest_dist_sq.sum()

    # Pick the remaining n_clusters-1 points
    for c in range(1, n_clusters):
        # Choose center candidates by sampling with probability proportional
        # to the squared distance to the closest existing center
        rand_vals = random_state.random_sample(n_local_trials) * current_pot
        candidate_ids = np.searchsorted(stable_cumsum(closest_dist_sq),
                                        rand_vals)
        # XXX: numerical imprecision can result in a candidate_id out of range
        np.clip(candidate_ids, None, closest_dist_sq.size - 1,
                out=candidate_ids)

        # Compute distances to center candidates
        # distance_to_candidates = euclidean_distances(
        #     X[candidate_ids], X, Y_norm_squared=x_squared_norms, squared=True)
        distance_to_candidates = distance_func(X[candidate_ids], X)

        # update closest distances squared and potential for each candidate
        np.minimum(closest_dist_sq, distance_to_candidates,
                   out=distance_to_candidates)
        candidates_pot = distance_to_candidates.sum(axis=1)

        # Decide which candidate is the best
        best_candidate = np.argmin(candidates_pot)
        current_pot = candidates_pot[best_candidate]
        closest_dist_sq = distance_to_candidates[best_candidate]
        best_candidate = candidate_ids[best_candidate]

        # Permanently add best center candidate found in local tries
        if sp.issparse(X):
            centers[c] = X[best_candidate].toarray()
        else:
            centers[c] = X[best_candidate]

    return centers

Package installation on Google Colab and Window 10 arise errors — Hassan, Oct 11 '21 at 12:22

score -1 · Answer 2 · edited May 18 '20 at 04:56

-1

The clustering itself will determine which amount of data points per cluster is desired.

If you want to break up the data into 4 equally big groups, based on proximity, then you should determine the 4 points, which are the most far apart, and then iteratively add the nearest neighbour to these data points, in case these are not already in a cluster. I would not expect this to look pretty though.

edited May 18 '20 at 04:56

Bhargav Rao

50,140
28
121
140

answered May 08 '20 at 16:46

bastianwur

26
2

It is not simple like you think, I use `n_clusters=4` because this is the best that I got and I use k-means to help me to choose `n` points from the sorted cluster. – I_Al-thamary May 08 '20 at 16:54
I still don't understand the question. Could you maybe give a visual example of what you have and what you want? – bastianwur May 08 '20 at 17:00
I need to group the points and I need them to be the same size but I don't have a specific size and I don't know if the four clusters are the best or not. Also, it is not easy to select the cluster head and select their nearest neighbor. https://ibb.co/F6ZRdr9 – I_Al-thamary May 08 '20 at 17:09
What is best? Because having each cluster be 1 point will for sure be a trivial answer, but that is probably not what you want. Cluster from 1 to X, select the clustering with the most even distribution, and the shuffle around the most distant points to your liking? I still don't see the point of clustering, if your basic criterion is having an even number of points, because that's not what clustering does. – bastianwur May 08 '20 at 17:14
Please read these answers https://stackoverflow.com/questions/5452576/k-means-algorithm-variation-with-equal-cluster-size – I_Al-thamary May 08 '20 at 17:21
I just did, and @BookSe down in the thread has it right, so... I dunno.... – bastianwur May 08 '20 at 17:27

Clustering positions on a map where each cluster has an equal number of points

2 Answers2

Linked