Calculate Silhouette coefficient for each sample in PySpark

Question

I have a Spark ML pipeline in pyspark that looks like this,

scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
pca = PCA(inputCol=scaler.getOutputCol(), outputCol="pca_output")
kmeans = clustering.KMeans(seed=2014)

pipeline = Pipeline(stages=[scaler, pca, kmeans])

After training the model, I wanted to get silhouette coefficients for each sample just like this function in sklearn

I know that I can use ClusteringEvaluator and generate scores for the whole dataset. But I want to do it for each sample instead.

How can I achieve this efficiently in pyspark?

Matt Andruff · Answer 1 · 2022-08-09T17:54:34.480

This has been explored before on Stack overflow. What I would change about the answer and would supplement is you can use LSH as part of spark. This essentially does blind clustering with a reduced set of dimensions. It reduces the number of comparisons and allows you to specify a 'boundary'(density limit) for your clusters. It could be used a good tool to enforce a level of density that you are interested in. You could run KMeans first and use the centroids as input to the approximate join or vice versa help you pick the number of kmeans points to look at.

I found this link helpful to understand the LSH.

All that said, you could partition the data by each kmean cluster and then run silhouette on a sample of the partitions(via mapPartitions). Then apply the sample score to the entire group. Here's a good explanation of how samples are taken so you don't have to start from scratch. I would assume that really dense clusters be underscored by silhouette samples, so this may not be a perfect way of going about things. But still would be informative.

Calculate Silhouette coefficient for each sample in PySpark

1 Answers1