6

There seems to be a lot of information about creating either hierarchical or k-means clusters. But I would like to know if there is an solution in R that would create K clusters of approximately equal sizes. There is some stuff out there about doing this in other languages, but I have not been able to find anything from searching on the internet that suggests how to achieve the result in R.

An example would be

set.seed(123)
df <- matrix(rnorm(100*5), nrow=100)
km <- kmeans(df, 10)
print(sapply(1:10, function(n) sum(km$cluster==n)))

which results in

[1] 14 12  4 13 16  6  8  7 13  7

I would ideally like to see

[1] 10 10 10 10 10 10 10 10 10 10 
Graeme
  • 333
  • 3
  • 14
  • All I can do is refer you to http://cran.r-project.org/web/views/Cluster.html for a comprehensive list of cluster-related packages. I hope someone familiar with that problem will have a particular suggestion for you though. – flodel Jan 06 '15 at 18:55
  • I believe Ward clustering will produce clusters of more equal size. Try playing with `hclust(d, method="ward.D")` or `hclust(d, method="ward.D2")` – JasonAizkalns Jan 06 '15 at 19:09
  • Thank you flodel and jaysunice. Jaysunice, I will look into that tomorrow. – Graeme Jan 06 '15 at 22:57
  • I've edited my question, but I'm not quite sure why I am being put on hold, except that understand I implicitly broke the "asking for tool or software library" part of above. However if this is the case, half of all R questions are doing the same thing. How is asking for a specific type of clustering that may be in some library, or may not, different from asking for function to convert dates, or do a specific type of graph (which will require a library) or combine two tables (requires a library) except that my question requires some level of knowledge that is difficult to search for – Graeme Jan 06 '15 at 23:42

2 Answers2

-1

I would argue that you shouldn't, in the first place. Why? When there are naturally well-formed clusters in your data, e.g.,

plot(matrix(c(sample(1:10,10),sample(30:40, 7), sample(80:90,9)), ncol=2, byrow = F))

then these will be clustered together anyway (assuming k equals the natural n of clusters; see this comprehensive answer on how to choose a good k). If they are uniform in size, then you will have clusters with ~equal size; if they are not, then forcing a uniform cluster size will surely deteriorate the fitness of the clustering solution. If you do not have naturally pretty clusters in your data, e.g,

plot(matrix(c(sample(1:100, 100), ncol=2)))

then forcing a cluster size will either be redundant (if the data is completely random, the cluster sizes will be ~equal - but then there is not much point in clustering anyhow), or, if there are some nice clusters in there, e.g.,

plot(matrix(c(sample(1:15,15),sample(20:100, 11)), ncol=2, byrow = T))

then the forced size will almost certainly break them.

The Ward's method mentioned in the comments by JasonAizkalns will, however, give you more "round" shaped clusters compared to single-link for example, so that might be a way to go (cf. help(hclust) for the difference between D and D2, it's not arbitrary).

Community
  • 1
  • 1
user3554004
  • 1,044
  • 9
  • 24
  • 2
    (-1) Doesn't answer the question. Say you have data from n mixed signals, equally sampled m times. One might want to cluster the data into n equally sized clusters. – catastrophic-failure Sep 12 '16 at 14:13
-3

Its not totally clear what you're asking, but it very easy to generate random data in R. If your data set has two dimensions you could do something like this -

cluster1 = data.frame(x = rnorm(100, mean=5,sd=1), y  = rnorm(100, mean=5,sd=1))
cluster2 = data.frame(x = rnorm(100, mean=15,sd=1), y  = rnorm(100, mean=15,sd=1))

This generates normally distributed random data across x and y for 100 data points in each cluster.

Then view it -

plot(cluster1, xlim = c(0,25), ylim = c(0,25))
lines(cluster2, type = "p")!
DG1
  • 171
  • 1
  • 8
  • 2
    I don't think you know what clustering analysis is. Imagine data with 200 points, the OP wants a process that will label the points into two clusters of 100 each. – flodel Jan 06 '15 at 18:53
  • I do... I thought he wanted to generate data to then do clustering analysis on for whatever reason. – DG1 Jan 06 '15 at 18:58