3

pvclust is great for cluster analysis in R. However, when running it as part of a batch operation, it is annoying to get different results for the same data. Obviously, there are many "correct" clusterings of the same data, and it seems that pvclust uses some randomness to determine the clusters of a specific run. But is there any way to get deterministic results?

I want to be able to present a minimal, repeatable analysis package: the data plus an R script, and a separate written document that contains my interpretations of the clustering. It is then possible for others to add to the analysis, e.g. by changing the aesthetic appearance of plots. Now, the interpretations will always be out of sync with what someone else gets when they run the script containing pvclust.

Ricardo Oliveros-Ramos
  • 4,322
  • 2
  • 25
  • 42
Fabian Fagerholm
  • 4,099
  • 1
  • 35
  • 45

2 Answers2

6

Not only for cluster analysis, but when there is randomness involved, you can fix the random number generator so you always get the same results.

Try:

set.seed(seed=123)
# your code here

The seed can be any integer, or something that can be converted to integer. And that's all.

Ricardo Oliveros-Ramos
  • 4,322
  • 2
  • 25
  • 42
2

i've only used k means. There I had to set the number of 'runs' or iterations to a higher value than default to get the same custers at consecutive runs.

lebatsnok
  • 6,329
  • 2
  • 21
  • 22
  • Good observation – with enough iterations, the pvclust will exhaust the search space and always converge on a "final" solution. But with a large data set, it could take a very long time, and finding the right number of iterations is a process of trial and error. Setting the random seed will produce the same result regardless of the desired number of iterations. – Fabian Fagerholm Jan 02 '14 at 07:11