I have been working on clustering a dataset in scala using spark 2.2.0. Now i have made the clusters , i want to test/evaluate the quality of it.Though i have been able to find the Set Of Sum of squared of errors for each value of K, but i was hoping to do a silhouette test. could any one please help in sharing any relevant functions,packages for doing so in scala.
Asked
Active
Viewed 816 times
1 Answers
-2
Silhouette is not scalable. It uses pairwise distances, this will always take O(n^2) time to compute.
Have you considered using already implemented in MLlib Within Set Sum of Squared Errors (http://spark.apache.org/docs/latest/ml-clustering.html#k-means) which also can help determining the number of clusters. (Cluster analysis in R: determine the optimal number of clusters)

Anush
- 149
- 1
- 1
- 4
-
yes, i have calculated and found a value of K which can be ideal on the basis of WSSSE exactly as its written in the link above, but wanted to know something else, because i have used to same value of K for the raw data as well as Normalized data, however the WSSSE for normalized data is way to high. – sayan sen Aug 22 '17 at 11:03
-
So was wondering that if i could check the silhoutte . Is there any other way to 1. check the quality of clusters and evaluate a value of K without using WSSSE 2. Determine a way where the Normalized as well as Raw data would exhibit same WSSSE value while the value of K is unchanged? Any other relevant suggestions in this regard would be highly appreciated. I am clustering on close to 5 million rows of data. – sayan sen Aug 22 '17 at 11:03