4

I've prepared a vector by sampling a log-normal distribution by (by trial and error) setting the parameters for mean and sd so rlnorm() returns exactly a mean of 20 and a sd of 6 (to 3 decimal places) for the any specified random set.seed() as per the following example...

# 10,000 samples from log-normal distribution

set.seed(7)
HcT <- rlnorm(n = 10000, log(19.147), log(1.33832))

# Report mean and sd

paste('The mean of HcT is',round(mean(HcT),3),'and the SD is',round(sd(HcT),3)) 

[1] "The mean of HcT is 20 and the SD is 6"

However, rather than trial and error I would like to 'goal seek' the two parameters. There are several stack overflow examples for single value goal seek but I'm not sure what function or package to apply for a two parameter case (mean and SD in the above sample).

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Markm0705
  • 1,340
  • 1
  • 13
  • 31

2 Answers2

4

It should work OK to minimize the sum of the squared deviations from the target values. There are pitfalls to this approach (see e.g. Numerical Recipes by Press et al.), but it should be OK for simple problems. The following code appears to retrieve the correct answers for your case:

f <- function(p,seed=7,target=c(20,6)) {
    mu <- log(p[1])
    sd <- log(p[2])
    set.seed(seed)
    r <- rlnorm(1e4,mu,sd)
    sum((c(mean(r),sd(r))-target)^2)
}

Choosing some non-ridiculous starting values ({15,2}):

optim(par=c(15,2), fn=f)

Based on @Cole's answer I would have thought this would work perfectly: draw normal deviates, transform them so they have a mean and sd exactly equal to the log-scale values, then exponentiate. But it only works on average or asymptotically (i.e., a large sample converges to the desired mean), not exactly for finite samples. Haven't thought through exactly why this is so.

rlnorm_exact <- function(n, m, sd) {
    m2 <-  log(m^2 / sqrt(sd^2 + m^2))
    sd2 <- sqrt(log(1 + (sd^2 / m^2)))
    r <- c(scale(rnorm(n)))
    return(exp(sd2*r+m2))
}
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Nice ... so if I set the output from optim(par=c(15,2), fn=f) to a variable say tmp then I can access the the parameters using tmp$par[1] for the mean and tmp$par[2] for the sd and feed them into the rlnorm() funciton – Markm0705 Jun 30 '19 at 11:39
  • It's not exact because each distribution provided by ```rlnorm()``` might be weighted slightly differently than a different call. In other words, a six-sided die has an average of 3.5. Sometimes you may get a string of 1's. Just because a sample mean is lower than 3.5 does not prevent the population from having a mean of 3.5. More to the point, if you return m2 and sd2, that would be equivalent of ```log(19.147)``` and ```log(1.33832)```. – Cole Jun 30 '19 at 19:03
  • yes, but that's what the `scale()` call does -- sets the mean and sd of the normal sample to exactly 0/1 ... – Ben Bolker Jul 01 '19 at 01:48
  • Ah, I see! I thought your edit was still returning mean and sd. Not sure what the OP prefers, but using ```rlnorm(n, m2, sd2)``` is a more random sample than always returning a sample with the same mean. – Cole Jul 01 '19 at 11:40
3

I ran into this problem before, too, and the link below set me straight. The rlnorm() isn't just simply using the log of the arithmatic mean and standard deviation. Instead, the function expects mu and sigma which are specific to the lognormal distribution.

Thankfully, the people at this link derived the formulas for us to transform to lognormal distributions.

I'm going to make this less pretty so people go to the link above as they solved this:

m <- 20
s <- 6

data_set <- rlnorm(n=1000000, 
                   meanlog=log(m^2 / sqrt(s^2 + m^2)), 
                   sdlog=sqrt(log(1 + (s^2 / m^2))))

mean(data_set)
sd(data_set)

Edit: changed variable from sd to s because sd() is also a function...

Cole
  • 11,130
  • 1
  • 9
  • 24
  • this is useful but doesn't quite do what the OP wanted, which is to get values that lead to *exact* matches with the target mean and sd ... – Ben Bolker Jun 30 '19 at 06:27
  • Based on his comment to your answer, I think this is exactly what OP wants - a lognormal distribution based on arithmetic mean and std. dev. So OP would end after the ```data_set``` call line. – Cole Jun 30 '19 at 12:16