1

I have performed k-mode clustering on categorical variables for historical data. I did clustering because I wanted to see what clusters the data falls into. Now that I have the output, if and when a new data comes in, is there any way where I can predict the cluster that it will fall into.

One way might be, since I have the data for each row and the cluster that it falls into I can use it as train data and do a supervised learning. But I want to know whether any possible method exists where I will be able to use the existing output variable to predict (sort of semi supervised learning)

I may not be able to share any data or output since I am working for a client, but any direction on how to approach will be highly helpful. I have been researching about it for quite sometime now but couldn't find a suitable solution.

Shades
  • 5,568
  • 7
  • 30
  • 48
Radhakrishnan
  • 266
  • 1
  • 2
  • 7
  • If you are unable to provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), then we are unable to help. It doesn't have to be your actual data, but you should be able to create an example with simulated data or one of the built in data sets in R. – MrFlick Oct 16 '17 at 14:14
  • 2
    Train a classifier of your choice on the clustered data. Then use the classifier to predict on the new data. – G5W Oct 16 '17 at 14:25
  • Alternatively, cluster "training" and "test" data together. It's more computationally expensive, since you would have to rerun your algorithm each time new data becomes available. – Artem Sokolov Oct 16 '17 at 15:44

1 Answers1

2

Most clustering algorithms cannot predict for new data.

KMeans and GMM are exceptions, and k-modes should work like k-means (find the most similar mode).

But usually, when you use clustering, you really should analyze the clusters and double-check this, as clusterings just don't get 100% right. Usually, you'll want some clusters from run A, some from run B etc. Whatever makes sense. Then train a classifier on the reviewed, cleaned up clusters for prediction.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194