Using dplyr and broom to compute kmeans on a training and test set

Question

I am using dplyr and broom to compute kmeans for my data. My data contains a test and training set of X and Y coordinates and are grouped by a some parameter value (lambda in this case):

mds.test = data.frame()
for(l in seq(0.1, 0.9, by=0.2)) {
  new.dist <- run.distance.model(x, y, lambda=l)
  mds <- preform.mds(new.dist, ndim=2)
  mds.test <- rbind(mds.test, cbind(mds$space, design[,c(1,3,4,5)], lambda=rep(l, nrow(mds$space)), data="test"))
}

> head(mds.test)
                        Comp1       Comp2 Transcripts Genes Timepoint Run lambda data
7A_0_AAGCCTAGCGAC -0.06690476 -0.25519106       68125  9324     Day 0  7A    0.1 test
7A_0_AAATGACTGGCC -0.15292848  0.04310200       28443  6746     Day 0  7A    0.1 test
7A_0_CATCTCGTTCTA -0.12529445  0.13022908       27360  6318     Day 0  7A    0.1 test
7A_0_ACCGGCACATTC -0.33015913  0.14647857       23038  5709     Day 0  7A    0.1 test
7A_0_TATGTCGGAATG -0.25826098  0.05424976       22414  5878     Day 0  7A    0.1 test
7A_0_GAAAAAGGTGAT -0.24349387  0.08071162       21907  6766     Day 0  7A    0.1 test

I've head the test dataset above but I also have one named mds.train which contains my training data coordinates. My ultimate goal here is to run k-means for both sets grouped by lambda, then compute the within.ss, between.ss and total.ss for the test data on the training centers. Thanks to a great resource on broom, I am able to run kmeans for each lambda for the test set by simply doing the following:

test.kclusts  = mds.test %>% 
  group_by(lambda) %>% 
  do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint))))

Then I can compute the centers of this data for each cluster within each lambda:

test.clusters = test.kclusts %>% 
  group_by(lambda) %>%  
  do(tidy(.$kclust[[1]]))

This is where I am stuck. How do I compute the feature assignments as similarly shown on the reference page (e.g. kclusts %>% group_by(k) %>% do(augment(.$kclust[[1]], points.matrix))), where my points.matrix is mds.test which is a data.frame with length(unique(mds.test$lambda)) times as many rows as should be? And is there a way to somehow use centers from the training set to compute glance() statistics based off the test assignments?

Any help would be greatly appreciated! Thank you!

EDIT: Updating progress. I have figured out how to aggregate the test/training assignments but am still having issues trying to compute kmeans statistics from both sets (training assignments on test center and test assignments on training centers). Updated code is below:

test.kclusts  = mds.test %>% group_by(lambda) %>% do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint))))
test.clusters = test.kclusts %>% group_by(lambda) %>%  do(tidy(.$kclust[[1]])) 
test.clusterings = test.kclusts %>% group_by(lambda) %>% do(glance(.$kclust[[1]]))
test.assignments = left_join(test.kclusts, mds.test) %>% group_by(lambda) %>% do(augment(.$kclust[[1]], cbind(.$Comp1, .$Comp2)))

train.kclusts  = mds.train %>% group_by(lambda) %>% do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint))))
train.clusters = train.kclusts %>% group_by(lambda) %>%  do(tidy(.$kclust[[1]])) 
train.clusterings = train.kclusts %>% group_by(lambda) %>% do(glance(.$kclust[[1]]))
train.assignments = left_join(train.kclusts, mds.train) %>% group_by(lambda) %>% do(augment(.$kclust[[1]], cbind(.$Comp1, .$Comp2)))

test.assignments$data = "test"
train.assignments$data = "train"
merge.assignments = rbind(test.assignments, train.assignments)
merge.assignments %>% filter(., data=='test') %>% group_by(lambda) ... ?

Ive attached a plot below which illustrates my progress to this point. Just to reiterate, I would like to compute kmeans statistics (within sum of square, total sum of squares, and between sum of squares) for the training data centers on test assignments/coordinates (the plots which the centers look off):

Looks like there's been some talk of making a tidymodels recipe or model specification, e.g. recipes::step_kmeans() (see conversation here and links from there: https://github.com/tidymodels/embed/issues/77#issuecomment-801300749 . But has not been done yet. There is also widyr::widely_kmeans() but this also doesn't have an easy to apply to a test set. — Bryan Shalloway, Aug 19 '21 at 07:58

Bryan Shalloway · Answer 1 · 2021-09-09T19:32:11.693

One approach would be to...

extract the table specifying the centroids of your clusters (built on the training set) via broom.
calculate the distance of each point in the test set from each of the cluster centroids built using the training set. Could do this via fuzzyjoin package.
the cluster centroid that a test point has the shortest Euclidian distance from represents its assigned cluster.
From there you can calculate any metrics of interest.

See below using a simpler dataset pulled from clustering example from tidymodels.

library(tidyverse)
library(rsample)
library(broom)
library(fuzzyjoin)

# data and train / test set-up
set.seed(27)
centers <- tibble(
  cluster = factor(1:3), 
  num_points = c(100, 150, 50),  # number points in each cluster
  x1 = c(5, 0, -3),              # x1 coordinate of cluster center
  x2 = c(-1, 1, -2)              # x2 coordinate of cluster center
)

labelled_points <- 
  centers %>%
  mutate(
    x1 = map2(num_points, x1, rnorm),
    x2 = map2(num_points, x2, rnorm)
  ) %>% 
  select(-num_points) %>% 
  unnest(cols = c(x1, x2))

points <- 
  labelled_points %>% 
  select(-cluster)

set.seed(1234)

split <- rsample::initial_split(points)
train <- rsample::training(split)
test <- rsample::testing(split)

# Fit kmeans on train then assign clusters to test
kclust <- kmeans(train, centers = 3)

clust_centers <- kclust %>% 
  tidy() %>% 
  select(-c(size, withinss))

test_clusts <- fuzzyjoin::distance_join(mutate(test, index = row_number()), 
                         clust_centers,
                         max_dist = Inf,
                         method = "euclidean",
                         distance_col = "dist") %>% 
  group_by(index) %>% 
  filter(dist == min(dist)) %>% 
  ungroup()
#> Joining by: c("x1", "x2")

# resulting table
test_clusts
#> # A tibble: 75 x 7
#>     x1.x    x2.x index  x1.y  x2.y cluster  dist
#>    <dbl>   <dbl> <int> <dbl> <dbl> <fct>   <dbl>
#>  1  4.24 -0.946      1  5.07 -1.10 3       0.847
#>  2  3.54  0.287      2  5.07 -1.10 3       2.06 
#>  3  3.71 -1.67       3  5.07 -1.10 3       1.47 
#>  4  5.03 -0.788      4  5.07 -1.10 3       0.317
#>  5  6.57 -2.49       5  5.07 -1.10 3       2.04 
#>  6  4.97  0.233      6  5.07 -1.10 3       1.34 
#>  7  4.43 -1.89       7  5.07 -1.10 3       1.01 
#>  8  5.34 -0.0705     8  5.07 -1.10 3       1.07 
#>  9  4.60  0.196      9  5.07 -1.10 3       1.38 
#> 10  5.68 -1.55      10  5.07 -1.10 3       0.758
#> # ... with 65 more rows

# calc within clusts SS on test
test_clusts %>% 
  group_by(cluster) %>% 
  summarise(size = n(),
            withinss = sum(dist^2),
            withinss_avg = withinss / size)
#> # A tibble: 3 x 4
#>   cluster  size withinss withinss_avg
#>   <fct>   <int>    <dbl>        <dbl>
#> 1 1          11     32.7         2.97
#> 2 2          35     78.9         2.26
#> 3 3          29     62.0         2.14

# compare to on train
tidy(kclust) %>% 
  mutate(withinss_avg = withinss / size)
#> # A tibble: 3 x 6
#>        x1    x2  size withinss cluster withinss_avg
#>     <dbl> <dbl> <int>    <dbl> <fct>          <dbl>
#> 1 -3.22   -1.91    40     76.8 1               1.92
#> 2  0.0993  1.06   113    220.  2               1.95
#> 3  5.07   -1.10    72    182.  3               2.53

# plot of test and train points
test_clusts %>% 
  select(x1 = x1.x, x2 = x2.x, cluster) %>% 
  mutate(type = "test") %>% 
  bind_rows(
    augment(kclust, train) %>% 
      mutate(type = "train") %>% 
      rename(cluster = .cluster)
    ) %>% 
  ggplot(aes(x = x1, 
             y = x2, 
             color = as.factor(cluster)))+
  geom_point()+
  facet_wrap(~fct_rev(as.factor(type)))+
  coord_fixed()+
  labs(title = "Cluster Assignment on Training and Holdout Datasets",
       color = "Cluster")+
  theme_bw()

^{Created on 2021-08-19 by the reprex package (v2.0.0)}

(See comment on OP for link to conversations on making this easier within tidymodels.)

Using dplyr and broom to compute kmeans on a training and test set

1 Answers1