210

Many times I have seen the set.seed function in R, before starting the program. I know it's basically used for the random number generation. Is there any specific need to set this?

Roland
  • 127,288
  • 10
  • 191
  • 288
Vignesh
  • 2,247
  • 2
  • 14
  • 12

7 Answers7

304

The need is the possible desire for reproducible results, which may for example come from trying to debug your program, or of course from trying to redo what it does:

These two results we will "never" reproduce as I just asked for something "random":

R> sample(LETTERS, 5)
[1] "K" "N" "R" "Z" "G"
R> sample(LETTERS, 5)
[1] "L" "P" "J" "E" "D"

These two, however, are identical because I set the seed:

R> set.seed(42); sample(LETTERS, 5)
[1] "X" "Z" "G" "T" "O"
R> set.seed(42); sample(LETTERS, 5)
[1] "X" "Z" "G" "T" "O"
R> 

There is vast literature on all that; Wikipedia is a good start. In essence, these RNGs are called Pseudo Random Number Generators because they are in fact fully algorithmic: given the same seed, you get the same sequence. And that is a feature and not a bug.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • 6
    Thanks Dirk, for such nice example..I have cleared it with 99%, but still question. 1. In your answer you have used set.seed with 42 as argument..is there any related reason for choosing this value ? – Vignesh Nov 29 '12 at 07:57
  • 51
    For a normal RNG of decent quality, the value doesn't matter. "42" is a reference to a famous book; other people use their birthday or "123" or just "1". – Dirk Eddelbuettel Nov 30 '12 at 01:30
  • 8
    The `char2seed` function in the TeachingDemos package allows you to set the seed (or choose a seed to pass into `set.seed`) based on a character string. For example you could have students use their name as the seed then each student has a unique dataset but the instructor can also create the same datasets for grading. – Greg Snow Dec 06 '12 at 22:26
  • 8
    It is possible to rerun the same code with different seeds until you get the "best" result (I have done this for examples). To guard against accusations of doing this it is best to choose a seed that has some obvious meaning, either always the same seed, or the date, or I use `char2seed` and the last name of the principle investigator on a project. – Greg Snow Dec 06 '12 at 22:28
  • If/when I'm that concerned about somebody questioning my choice of a random seed, I first randomly generate a seed, then use it. Something like: `seed <- sample(.Machine$integer.max, size=1) ; seed ; set.seed(seed);`. It doesn't necessarily mitigate the type of accusation @GregSnow suggested (a concerted effort will still allow the researcher to "fix" the results), but at least it keeps me from always using the same seed. – r2evans May 02 '15 at 19:33
  • 9
    @DirkEddelbuettel seed value *can* matter for non-computational reasons, a friend of mine had problems with publishing his simulation-based results because the code started with `set.seed(666)` and the reviewers did not like the Devils seed in the code... – Tim Oct 23 '15 at 08:25
  • Can I ask a super dumb question here - if I set.seed(123) for a sample in my console, then shared that on RStudio along with the vector that is sampled - would you get the same result if you are using the same seed, even though it’s on a different machine? – hachiko Nov 13 '20 at 02:41
  • 1
    @hachiko That is actually a good and valid question. I think the answer is "possibly but not guaranteed". On similar hardwre and with similar R versions it should. But generally the reproducibility from `set.seed(...)` is meant to ensure reproducibility on the same (or near identical) machine. Sometimes the software changes (R famously fixed a bug a release or two ago so the results differ _on the same machine_ between an "old" R version and the current one -- but you can set a toggle to reproduce old results). – Dirk Eddelbuettel Nov 13 '20 at 02:46
  • Oh wow that’s very helpful. I’ve seen stackoverflow posts that include the set.seed code and I was wondering if that were intentional or not. Another question - does my console have a long memory of the seed? Would I always want set.seed to appear right before sample? And that seed will only last for the next random function? Is “sample” the only function that works with seed, or are there others ? – hachiko Nov 13 '20 at 02:49
  • All functions assessing the R RNGs (there are several engines for uniform and normal) are affected by `set.seed()` and not it does survive the session. _You_ however can record the seed you use to start each session or run and ensure reproducibility. The comments here are not a good way to go deeper -- please see `help(Random)` in R (terse as it is, it is precise) and see some more general tutorials. Good luck, you asked the right questions and you are on the right track. – Dirk Eddelbuettel Nov 13 '20 at 02:59
35

You have to set seed every time you want to get a reproducible random result.

set.seed(1)
rnorm(4)
set.seed(1)
rnorm(4)
Chia-hung
  • 501
  • 4
  • 7
20

Just adding some addition aspects. Need for setting seed: In the academic world, if one claims that his algorithm achieves, say 98.05% performance in one simulation, others need to be able to reproduce it.

?set.seed

Going through the help file of this function, these are some interesting facts:

(1) set.seed() returns NULL, invisible

(2) "Initially, there is no seed; a new one is created from the current time and the process ID when one is required. Hence different sessions will give different simulation results, by default. However, the seed might be restored from a previous session if a previously saved workspace is restored.", this is why you would want to call set.seed() with same integer values the next time you want a same sequence of random sequence.

TobiMcNamobi
  • 4,687
  • 3
  • 33
  • 52
Ridingstar
  • 301
  • 2
  • 3
8

Fixing the seed is essential when we try to optimize a function that involves randomly generated numbers (e.g. in simulation based estimation). Loosely speaking, if we do not fix the seed, the variation due to drawing different random numbers will likely cause the optimization algorithm to fail.

Suppose that, for some reason, you want to estimate the standard deviation (sd) of a mean-zero normal distribution by simulation, given a sample. This can be achieved by running a numerical optimization around steps

  1. (Setting the seed)
  2. Given a value for sd, generate normally distributed data
  3. Evaluate the likelihood of your data given the simulated distributions

The following functions do this, once without step 1., once including it:

# without fixing the seed
simllh <- function(sd, y, Ns){
  simdist <- density(rnorm(Ns, mean = 0, sd = sd))
  llh <- sapply(y, function(x){ simdist$y[which.min((x - simdist$x)^2)] })
  return(-sum(log(llh)))
}
# same function with fixed seed
simllh.fix.seed <- function(sd,y,Ns){
  set.seed(48)
  simdist <- density(rnorm(Ns,mean=0,sd=sd))
  llh <- sapply(y,function(x){simdist$y[which.min((x-simdist$x)^2)]})
  return(-sum(log(llh)))
}

We can check the relative performance of the two functions in discovering the true parameter value with a short Monte Carlo study:

N <- 20; sd <- 2 # features of simulated data
est1 <- rep(NA,1000); est2 <- rep(NA,1000) # initialize the estimate stores
for (i in 1:1000) {
  as.numeric(Sys.time())-> t; set.seed((t - floor(t)) * 1e8 -> seed) # set the seed to random seed
  y <- rnorm(N, sd = sd) # generate the data
  est1[i] <- optim(1, simllh, y = y, Ns = 1000, lower = 0.01)$par
  est2[i] <- optim(1, simllh.fix.seed, y = y, Ns = 1000, lower = 0.01)$par
}
hist(est1)
hist(est2)

The resulting distributions of the parameter estimates are:

Histogram of parameter estimates without fixing the seed Histogram of parameter estimates fixing the seed

When we fix the seed, the numerical search ends up close to the true parameter value of 2 far more often.

MS Berends
  • 4,489
  • 1
  • 40
  • 53
7

basically set.seed() function will help to reuse the same set of random variables , which we may need in future to again evaluate particular task again with same random varibales

we just need to declare it before using any random numbers generating function.

user4388407
  • 81
  • 1
  • 1
1

set.seed is a base function that it is able to generate (every time you want) together other functions (rnorm, runif, sample) the same random value.

Below an example without set.seed

> set.seed(NULL)
> rnorm(5)
[1]  1.5982677 -2.2572974  2.3057461  0.5935456  0.1143519
> rnorm(5)
[1]  0.15135371  0.20266228  0.95084266  0.09319339 -1.11049182
> set.seed(NULL)
> runif(5)
[1] 0.05697712 0.31892399 0.92547023 0.88360393 0.90015169
> runif(5)
[1] 0.09374559 0.64406494 0.65817582 0.30179009 0.19760375
> set.seed(NULL)
> sample(5)
[1] 5 4 3 1 2
> sample(5)
[1] 2 1 5 4 3

Below an example with set.seed

> set.seed(123)
> rnorm(5)
[1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774
> set.seed(123)
> rnorm(5)
[1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774
> set.seed(123)
> runif(5)
[1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673
> set.seed(123)
> runif(5)
[1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673
> set.seed(123)
> sample(5)
[1] 3 2 5 4 1
> set.seed(123)
> sample(5)
[1] 3 2 5 4 1
Earl Mascetti
  • 1,278
  • 3
  • 16
  • 31
1

Just to add further... You need to set the seed every time you do some random stuff if you want consistency. The seed doesn't remain set.

set.seed(0)
rnorm(3)
set.seed(0)
rnorm(3)

[1]  1.2629543 -0.3262334  1.3297993
[1]  1.2629543 -0.3262334  1.3297993
set.seed(0)
rnorm(3)
rnorm(3)

[1]  1.2629543 -0.3262334  1.3297993
[1]  1.2724293  0.4146414 -1.5399500
Harley
  • 1,305
  • 1
  • 13
  • 28