2

It seems that the 'SwarmSVM' package used to have a kmeans.predict function, but no longer does.

I would like to divide a dataframe to training+testing subsets to train a model and then test it. I am currently only able to use the 'kmeans' function to create clusters, but I can't figure out which functions/packages to use to train and test a model.

Maria Gold
  • 59
  • 3
  • 6
  • Here are a few ways to split your data into training and testing https://stackoverflow.com/questions/17200114/how-to-split-data-into-training-testing-sets-using-sample-function The `caTools` library might be useful – antonioACR1 Feb 27 '18 at 19:37

2 Answers2

8

k-means is a clustering method, i.e. for unsupervised learning, not supervised, and as such isn't designed to predict on future data, as adding more data would change the centers. Supervised alternatives that can do classification include k-NN, LDA/QDA, and SVMs, but such an approach would require a training set with known classes.

All that said, you could write a predict method for stats::kmeans using dist, as you're presumably really looking for the closest center to the point. Hardly optimized, but functional:

predict.kmeans <- function(object, newdata){
    centers <- object$centers
    n_centers <- nrow(centers)
    dist_mat <- as.matrix(dist(rbind(centers, newdata)))
    dist_mat <- dist_mat[-seq(n_centers), seq(n_centers)]
    max.col(-dist_mat)
}

set.seed(47)
in_train <- sample(nrow(iris), 100)
mod_kmeans <- kmeans(iris[in_train, -5], 3)
test_preds <- predict(mod_kmeans, iris[-in_train, -5])

table(test_preds, iris$Species[-in_train])
#>           
#> test_preds setosa versicolor virginica
#>          1      0          0        10
#>          2      0         18         7
#>          3     15          0         0
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • Thanks! I will try to figure this out, this seems to potentially be the thing I was looking for. – Maria Gold Mar 01 '18 at 17:08
  • @alistaire, supose I use k-means cluster as a predictor and train data. Should not I use your function, e.g., to make predictions on new data? – xm1 Nov 18 '20 at 14:52
  • You can use it to score, certainly, but you need some certainty that the training data comes from the same distribution as what you're scoring on, else the centers are not really representative of modes. – alistaire Nov 20 '20 at 03:03
0
install.packages("class")
library(class)

use the knn function

for further help see use

 ?knn
Michael Cantrall
  • 313
  • 3
  • 15
  • the question is, how can I predict using clustering? that is a very interesting issue, of course, we know that using a supervised method as KNN can give us an accurate prediction, but the question is, how I can predict using a non supervised method like this one. – pabloverd Feb 11 '20 at 20:50