Cluster analysis in R: How can I get deterministic results from pvclust?

Question

pvclust is great for cluster analysis in R. However, when running it as part of a batch operation, it is annoying to get different results for the same data. Obviously, there are many "correct" clusterings of the same data, and it seems that pvclust uses some randomness to determine the clusters of a specific run. But is there any way to get deterministic results?

I want to be able to present a minimal, repeatable analysis package: the data plus an R script, and a separate written document that contains my interpretations of the clustering. It is then possible for others to add to the analysis, e.g. by changing the aesthetic appearance of plots. Now, the interpretations will always be out of sync with what someone else gets when they run the script containing pvclust.

score 6 · Accepted Answer · answered Jan 02 '14 at 05:53

Not only for cluster analysis, but when there is randomness involved, you can fix the random number generator so you always get the same results.

Try:

set.seed(seed=123)
# your code here

The seed can be any integer, or something that can be converted to integer. And that's all.

score 2 · Answer 2 · answered Jan 02 '14 at 06:08

2

i've only used k means. There I had to set the number of 'runs' or iterations to a higher value than default to get the same custers at consecutive runs.

answered Jan 02 '14 at 06:08

lebatsnok

6,329
2
21
22

Good observation – with enough iterations, the pvclust will exhaust the search space and always converge on a "final" solution. But with a large data set, it could take a very long time, and finding the right number of iterations is a process of trial and error. Setting the random seed will produce the same result regardless of the desired number of iterations. – Fabian Fagerholm Jan 02 '14 at 07:11

Cluster analysis in R: How can I get deterministic results from pvclust?

2 Answers2