Simple KMeans model
Data
library(tidyverse)
library(broom)
library(cluster)
set.seed(42)
data = tibble(x = c(rnorm(20, 5, 1), rnorm(20, 10, 3)),
y = c(rnorm(20, 5, 1), rnorm(20, 10, 3)))
Model
kmeans_k2 <- kmeans(data, 2, 9)
Tidymodel Results
tidy(kmeans_k2)
# A tibble: 2 x 5
x y size withinss cluster
<dbl> <dbl> <int> <dbl> <fct>
1 9.60 10.8 18 255. 1
2 5.22 5.20 22 76.6 2
glance(kmeans_k2)
# A tibble: 1 x 4
totss tot.withinss betweenss iter
<dbl> <dbl> <dbl> <int>
1 836. 331. 505. 1
augment(kmeans_k2, data)
# A tibble: 40 x 3
x y .cluster
<dbl> <dbl> <fct>
1 6.37 5.21 2
2 4.44 4.64 2
3 5.36 5.76 2
4 5.63 4.27 2
5 5.40 3.63 2
6 4.89 5.43 2
7 6.51 4.19 2
8 4.91 6.44 2
9 7.02 4.57 2
10 4.94 5.66 2
# ... with 30 more rows
Silhouette Calculation
sil_k2 <- silhouette(kmeans_k2$cluster, dist(data))
summary(sil_k2)
Silhouette of 40 units in 2 clusters from silhouette.default(x = kmeans_k2$cluster, dist = dist(data)) :
Cluster sizes and average silhouette widths:
18 22
0.3527383 0.7053056
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04243 0.45422 0.58233 0.54665 0.74472 0.79591
summary(sil_k2)$si.summary[4]
Mean
0.5466503
Nested tibble using Purrr's map function to create results for multiple different k's
kmeans_k123 <- tibble(k = 1:3) %>%
mutate(km_model = map(k, ~kmeans(data, .x)),
tidydata = map(km_model, tidy),
glancedata = map(km_model, glance),
augmentdata = map(km_model, augment, data))
kmeans_k123
# A tibble: 3 x 5
k km_model tidydata glancedata augmentdata
<int> <list> <list> <list> <list>
1 1 <kmeans> <tibble [1 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
2 2 <kmeans> <tibble [2 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
3 3 <kmeans> <tibble [3 x 5]> <tibble [1 x 4]> <tibble [40 x 3]>
pluck(kmeans_k23, 3, 2)
# A tibble: 2 x 5
x y size withinss cluster
<dbl> <dbl> <int> <dbl> <fct>
1 5.63 5.10 26 117. 1
2 11.3 10.7 14 176. 2
Question is, How do I add the Silhouette score to the nested tibble? The Silhouette function needs the clusters of each model and I'm not sure how to do that. Obviously I can pluck out a single instance, such as
data_k2cluster <- pluck(kmeans_k123, 2, 2)$cluster
data_k2cluster
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1
sil_k2v2 <- silhouette(data_k2cluster, dist(data))
summary(sil_k2v2)
Silhouette of 40 units in 2 clusters from silhouette.default(x = data_k2cluster, dist = dist(data)) :
Cluster sizes and average silhouette widths:
18 22
0.3527383 0.7053056
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04243 0.45422 0.58233 0.54665 0.74472 0.79591
But when I try and use that with map it doesn't work
kmeans_k123 %>% mutate(sildata = map2(km_model$cluster, data, silhouette))
Error: Problem with `mutate()` input `sildata`.
x Input `sildata` can't be recycled to size 3.
i Input `sildata` is `map2(km_model$cluster, data, silhouette)`.
i Input `sildata` must be size 3 or 1, not 0.
I can create a function, which again works for a single occurence
my_fn <- function(f_cluster, f_data){my_fn <- silhouette(f_cluster, dist(f_data))}
summary(my_fn(kmeans_k2$cluster, data))
Silhouette of 40 units in 2 clusters from silhouette.default(x = f_cluster, dist = dist(f_data)) :
Cluster sizes and average silhouette widths:
18 22
0.3527383 0.7053056
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.04243 0.45422 0.58233 0.54665 0.74472 0.79591
But fails when I use it with map.
kmeans_k123 %>% mutate(sildata = map2(km_model$cluster, data, my_fn))
Error: Problem with `mutate()` input `sildata`.
x Input `sildata` can't be recycled to size 3.
i Input `sildata` is `map2(km_model$cluster, data, my_fn)`.
i Input `sildata` must be size 3 or 1, not 0.
I suspect that the problem is related to how I am trying to retrieve $cluster from the nested models as I've tried extracting that to create it's own column, but can't get that to work either.