0

this is about centroid initializations in sklearn's KMeans. I want to initialize the centroids in a "linear" way as follows: Linear initialization linearly spaces the centroids between the [min, max] of the original data samples. Where min is the minimum value amongst data samples and max is the maximum value amongst data samples.

I think I can do this using by passing a callable to "sklearn.cluster.KMeans" but don't know how to.

I found the following code which imitates the "kmeans++" initialization method:

# Initialize random dataset-
# X = np.random.rand(5, 2)
X, y = make_blobs(n_samples = 500, n_features = 2, centers = 3)

# 'X' - generated samples
# 'y' - integer labels for cluster membership of each sample

X.shape, y.shape
# ((500, 2), (500,))

np.min(X), np.max(X)
# (-8.23395492070187, 11.85003843624708)


def plus_plus(ds, k, random_state=42):
    """
    Create cluster centroids using the k-means++ algorithm.

    Parameters
    ----------
    ds : numpy array
        The dataset to be used for centroid initialization.

    k : int
        The desired number of clusters for which centroids are required.

    Returns
    -------
    centroids : numpy array
                Collection of k centroids as a numpy array.

    Inspiration from here: https://stackoverflow.com/questions/5466323/how-could-one-implement-the-k-means-algorithm
    """

    # np.random.seed(random_state)
    centroids = [ds[0]]

    for _ in range(1, k):
        dist_sq = np.array([min([np.inner(c - x, c - x) for c in centroids]) for x in ds])
        probs = dist_sq/dist_sq.sum()
        cumulative_probs = probs.cumsum()
        r = np.random.rand()

        for j, p in enumerate(cumulative_probs):
            if r < p:
                i = j
                break

        centroids.append(ds[i])

    return np.array(centroids)




# Create cluster centroids using k-means++ algo-
centroids = plus_plus(X, 3)

centroids.shape
# (3, 2)

For this code example, "min" = -8.23395492070187 and "max" = 11.85003843624708 and therefore, the centroids have to be linearly spaced (linearly initialized) between [min, max].

Can you please help?

Thanks!

Arun
  • 2,222
  • 7
  • 43
  • 78
  • by linearly do you mean you need to split (min, max) into k sectors and centroids to be in the middle of each sector? – hellpanderr Jun 13 '20 at 19:30
  • ie. for centers to be somewhere in `[-4.900621587368535, 1.7660450792981313, 8.432711745964799]` – hellpanderr Jun 13 '20 at 19:35
  • @hellpanderr yeah, dividing the range of values in the given sample say 'X' between min & max into 'k' bins and centroids to be in the center of them. – Arun Jun 13 '20 at 21:14
  • you can try passing an array of shape (n_clusters, n_features) with centroids as `init` parameter – hellpanderr Jun 13 '20 at 22:08

0 Answers0