Normalising Data to use Cosine Distance in Kmeans (Python)

Question

I am currently solving a problem where I have to use Cosine distance as the similarity measure for Kmeans clustering. However, the standard Kmeans clustering package (from Sklearn package) uses Euclidean distance as standard, and does not allow you to change this.

Therefor it is my understanding that by normalising my original dataset through the the code below. I can then run kmeans package (using Euclidean distance) and it will be the same as if I had changed the distance metric to Cosine Distance?

from sklearn import preprocessing  # to normalise existing X
X_Norm = preprocessing.normalize(X)

km2 = cluster.KMeans(n_clusters=5,init='random').fit(X_Norm)

Please let me know if my mathematical understanding of this is incorrect?

[This](https://stats.stackexchange.com/questions/146221/is-cosine-similarity-identical-to-l2-normalized-euclidean-distance) may be of interest to you. Based on it, I suspect that what you are suggesting isn't quite right, especially considering that KMeans is sensitive to scaling. However, I'm not sure how to solve the problem... — WhoIsJack, Aug 21 '17 at 12:11
Thanks very much, that's the one I found earlier, but so far I don't have a better solution. — MSalty, Aug 21 '17 at 12:54
Have a look at [this](https://stackoverflow.com/questions/5529625/is-it-possible-to-specify-your-own-distance-function-using-scikit-learn-k-means). The most important points: (1) what you want to do can apparently be done with the `nltk` package or with a [`kernel kmeans`](https://gist.github.com/mblondel/6230787) approach that is in the process of being added to sklearn, ... — WhoIsJack, Aug 21 '17 at 13:57
... (2) what you want to do should probably not be done because the mean is not a good estimation for cluster centers in non-Euclidean space (see comment by Anony-Mousse), (3) a better alternative may be `kmedoids` (`PAM`), for which unfortunately there is no sklearn implementation AFAIK, but there is one [here](https://github.com/letiantian/kmedoids) that accepts distance matrices, so you could use that with cosine distance. — WhoIsJack, Aug 21 '17 at 13:57
Thank you for your help @WhoIsJack that is exactly what I am after. — MSalty, Aug 22 '17 at 11:11

Normalising Data to use Cosine Distance in Kmeans (Python)

0 Answers0