this is about centroid initializations in sklearn's KMeans. I want to initialize the centroids in a "linear" way as follows: Linear initialization linearly spaces the centroids between the [min, max] of the original data samples. Where min is the minimum value amongst data samples and max is the maximum value amongst data samples.
I think I can do this using by passing a callable to "sklearn.cluster.KMeans" but don't know how to.
I found the following code which imitates the "kmeans++" initialization method:
# Initialize random dataset-
# X = np.random.rand(5, 2)
X, y = make_blobs(n_samples = 500, n_features = 2, centers = 3)
# 'X' - generated samples
# 'y' - integer labels for cluster membership of each sample
X.shape, y.shape
# ((500, 2), (500,))
np.min(X), np.max(X)
# (-8.23395492070187, 11.85003843624708)
def plus_plus(ds, k, random_state=42):
"""
Create cluster centroids using the k-means++ algorithm.
Parameters
----------
ds : numpy array
The dataset to be used for centroid initialization.
k : int
The desired number of clusters for which centroids are required.
Returns
-------
centroids : numpy array
Collection of k centroids as a numpy array.
Inspiration from here: https://stackoverflow.com/questions/5466323/how-could-one-implement-the-k-means-algorithm
"""
# np.random.seed(random_state)
centroids = [ds[0]]
for _ in range(1, k):
dist_sq = np.array([min([np.inner(c - x, c - x) for c in centroids]) for x in ds])
probs = dist_sq/dist_sq.sum()
cumulative_probs = probs.cumsum()
r = np.random.rand()
for j, p in enumerate(cumulative_probs):
if r < p:
i = j
break
centroids.append(ds[i])
return np.array(centroids)
# Create cluster centroids using k-means++ algo-
centroids = plus_plus(X, 3)
centroids.shape
# (3, 2)
For this code example, "min" = -8.23395492070187 and "max" = 11.85003843624708 and therefore, the centroids have to be linearly spaced (linearly initialized) between [min, max].
Can you please help?
Thanks!