23


I am trying to implement Kmeans algorithm in python which will use cosine distance instead of euclidean distance as distance metric.
I understand that using different distance function can be fatal and should done carefully. Using cosine distance as metric forces me to change the average function (the average in accordance to cosine distance must be an element by element average of the normalized vectors).

I have seen this elegant solution of manually overriding the distance function of sklearn, and I want to use the same technique to override the averaging section of the code but I couldn't find it.

Does anyone knows How can it be done ?
How critical is it that the distance metric doesn't satisfy the triangular inequality?
If anyone knows a different efficient implementation of kmeans where I use cosine metric or satisfy an distance and averaging functions it would also be realy helpful.
Thank you very much!

Edit:
After using the angular distance instead of cosine distance, The code looks as something like that:

def KMeans_cosine_fit(sparse_data, nclust = 10, njobs=-1, randomstate=None):
    # Manually override euclidean
    def euc_dist(X, Y = None, Y_norm_squared = None, squared = False):
        #return pairwise_distances(X, Y, metric = 'cosine', n_jobs = 10)
        return np.arccos(cosine_similarity(X, Y))/np.pi
    k_means_.euclidean_distances = euc_dist
    kmeans = k_means_.KMeans(n_clusters = nclust, n_jobs = njobs, random_state = randomstate)
    _ = kmeans.fit(sparse_data)
    return kmeans

I noticed (with mathematics calculations) that if the vectors are normalized the standard average works well for the angular metric. As far as I understand, I have to change _mini_batch_step() in k_means_.py. But the function is pretty complicated and I couldn't understand how to do it.
Does anyone knows about alternative solution?
Or maybe, Does anyone knows how can I edit this function with a one that always forces the centroids to be normalized?

ise372
  • 231
  • 1
  • 2
  • 5
  • Take a look at [k_means_.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/k_means_.py) in the scikit-learn source code. The cosine distance example you linked to is doing nothing more than replacing a function variable called `euclidean_distance` in the `k_means_` module with a custom-defined function. If you post your k-means code and what function you want to override, I can give you a more specific answer. But if you want to do it yourself, just look for the name of the averaging function in the `k_means_` source and replace it. – charlesreid1 Sep 25 '17 at 17:00
  • Also, in general SO questions should include a [minimal, complete, viable example](https://stackoverflow.com/help/mcve) - you can expect to get more help if you include the code you want to modify or what is not working. – charlesreid1 Sep 25 '17 at 17:02
  • @charlesreid1 Thank you , I added the code. My problem is that I haven't completely understood how the average function in `k_means_.py` works and thus I couldn't understand how to change it. – ise372 Sep 26 '17 at 11:16
  • 1
    There is a python package called [spherecluster](https://github.com/clara-labs/spherecluster) that implements the K-means algorithm on a sphere (so it does essentially the same thing as what you are attempting to do). – σηγ Sep 26 '17 at 17:50
  • try this https://gist.github.com/mblondel/6230787 – Cătălin George Feștilă Mar 15 '20 at 09:57
  • You can try k-medoids, which will support any distance metric. It doesn't use 'means' as centers, but existing data points. https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html – Bert Kellerman Apr 15 '20 at 06:06

3 Answers3

13

So it turns out you can just normalise X to be of unit length and use K-means as normal. The reason being if X1 and X2 are unit vectors, looking at the following equation, the term inside the brackets in the last line is cosine distance. vect_dist

So in terms of using k-means, simply do:

length = np.sqrt((X**2).sum(axis=1))[:,None]
X = X / length

kmeans = KMeans(n_clusters=10, random_state=0).fit(X)

And if you need the centroids and distance matrix do:

len_ = np.sqrt(np.square(kmeans.cluster_centers_).sum(axis=1)[:,None])
centers = kmeans.cluster_centers_ / len_
dist = 1 - np.dot(centers, X.T) # K x N matrix of cosine distances

Notes:

  • Just realised that you are trying to minimise the distance between the mean vector of the cluster, and its constituents. The mean vector has length of less than one when you simply average the vectors. But in practice, it's still worth running the normal sklearn algorithm and checking the length of the mean vector. In my case the mean vectors were close to unit length (averaging around 0.9, but this depends on how dense your data is). TLDR: Use the spherecluster package as @σηγ pointed out.
sachinruk
  • 9,571
  • 12
  • 55
  • 86
  • 2
    Relevant discussion from our friends over on Cross Validated --> https://stats.stackexchange.com/a/146279/243511 – timhealz Apr 27 '20 at 20:46
  • If you use sklearn.feature_extraction.text.TfidfVectorizer, it seems that the L2 normalization is applied by default, i.e., the output of the vectorizer is already normalized. – tomas May 31 '21 at 13:04
7

You can normalize your data and then use KMeans.

from sklearn import preprocessing
from sklearn.cluster import KMeans

kmeans = KMeans().fit(preprocessing.normalize(X))
ricecooker
  • 81
  • 2
  • 5
1

Unfortunately no. Sklearn current implementation of k-means only uses Euclidean distances.

The reason is K-means includes calculation to find the cluster center and assign a sample to the closest center, and Euclidean only have the meaning of the center among samples.

If you want to use K-means with cosine distance, you need to make your own function or class. Or, try to use other clustering algorithm such as DBSCAN.

Gilseung Ahn
  • 2,598
  • 1
  • 4
  • 11