1

if I have this

   x = c(0.5,0.1,0.3,6,5,2,1,4,2,1,0.9,3,6,99,22,11,44,55)

apply kmeans

   kmeans(x, centers=5, iter.max = 10, nstart = 1)

this gives:

        Cluster means:
          [,1]
    1 99.00
    2 49.50
    3  22.00
    4  7.00
    5  1.48

     Clustering vector:
      [1] 5 5 5 4 4 5 5 5 5 5 5 5 4 1 3 4 2 2

Now I want to classify my values in x2 based on the clusters of x (1,2,3,4,5). How to do this?

x2 = c(0.3,1,3,0.66,0.5,0.2,0.1,64,92,21,0.93,93,6,99,22,11,44,55)
neilfws
  • 32,751
  • 5
  • 50
  • 63
Tpellirn
  • 660
  • 4
  • 11
  • k-means isn't really designed for prediction on new data. [See this answer](https://stackoverflow.com/a/49017208/89482). – neilfws May 24 '22 at 06:27

2 Answers2

1

Here is a naive approch based on distance to centroids of each cluster:

km <-  kmeans(x, centers=5, iter.max = 10, nstart = 1)
group <- sapply(x2, function(xx) which.min(abs(km$centers - xx)))

plot(x = x, y = rep(1, length(x)), col = km$cluster)
points(x = km$centers, y = rep(1, length(km$centers)), col = "purple", pch = "*")
points(x = x2, y = rep(1, length(x2)), col = group, pch = "+")

Please check the link provided by @neilfws about doing predictions with kmeans.

Clemsang
  • 5,053
  • 3
  • 23
  • 41
  • Thanks for this answer. Just one question. When I do > km$centers [,1] 1 1.48 2 49.50 3 7.00 4 99.00 5 22.00.......... so cluster 5 is 22. but when I do km$cluster, cluster 5 is 1.48. I am confused. – Tpellirn May 24 '22 at 09:03
  • See `?kmeans`: 'A vector of integers (from 1:k) indicating the cluster to which each point is allocated', here from your object `x` each integer is the corresponding cluster – Clemsang May 24 '22 at 09:15
1

There is a predict method for kmeans clusters in the fdm2id package. Load the package and type ?KMEANS and ?predict.kmeans for more information.

library(data.table)
library(fdm2id)
#
x  <- c(0.5,0.1,0.3,6,5,2,1,4,2,1,0.9,3,6,99,22,11,44,55)
dt <- data.table(x2 = c(0.3,1,3,0.66,0.5,0.2,0.1,64,92,21,0.93,93,6,99,22,11,44,55))
dt[, pred:=predict(KMEANS(x, k = 5), newdata=x2)]

Note that the cluster numbers are arbitrary in the sense that they merely distinguish between clusters, so the numbering might not be the same but the cluster membership should align.

jlhoward
  • 58,004
  • 7
  • 97
  • 140