How to do K-means clustering on a dataset full of string variables in r

Question

Right now I have a dataset that is full of string variables, but I want to do a clustering project on that. After I apply as.factor() to all the variables, nbclust() still could not work, what am I suppose to do?

score 0 · Accepted Answer · answered May 31 '18 at 00:58

K-means typically uses Euclidean distances (see e.g. https://stats.stackexchange.com/questions/81481/why-does-k-means-clustering-algorithm-use-only-euclidean-distance-metric) so you can't directly "cluster on words".

If you want to cluster observations based on words, you have to generate numbers (e.g. k-means for text clustering) For example if you were trying to cluster customer profiles to do segmentation, you could count up words representing their interests in their profiles, and then have one column per interest, and count the number of times that word or n-gram appeared in the profile, then cluster on that matrix of numbers. Or in clustering documents, generate a term-document matrix (or document-term matrix, or term-term occurrence like k-means clustering on term-term co-ocurrence matrix) and use those numbers for clustering.

score 0 · Answer 2 · answered Jun 02 '18 at 08:52

Don't use k-means on such data.

You can't get meaningful statistical analysis just by "trial and error". Because there are many ways to get a result that looks okayish but that is totally unfounded.

Before you use any of these approaches, you need to understand what it does. In the case of k-means, it minimizes least squares, which obviously makes only sense on continuous variables. They also need to behave linearly. If you have multiple variables, they also need to have the same magnitude.

It's not a black box method. If you use it badly, you just get garbage out.

How to do K-means clustering on a dataset full of string variables in r

2 Answers2