2

In trying to fit a lognormal distribution to my truncated data, I found the following two Stackoverflow posts and followed them:

Fitting a lognormal distribution to truncated data in R Fitting a truncated lognormal distribution in R

However, it appears this solution no longer works, as the dtrunc and ptrunc functions from the truncdist package now fail to pass the test function of fitdistrplus.

dtruncated_log_normal <- function(x, a,b, meanlog, sdlog)
  dtrunc(x, "lnorm", a=a, b=b, meanlog=meanlog, sdlog=sdlog)

ptruncated_log_normal <- function(q, a,b, meanlog, sdlog)
  ptrunc(q, "lnorm", a=a, b=b, meanlog=meanlog, sdlog=sdlog)

fit <- fitdist(s, "truncated_log_normal", start=list(a=0.001, b=90, meanlog=mean(log(s)), sdlog=sd(log(s))))

We run into basically all the errors of the test function, returning

Error in fitdist(s, "truncated_log_normal", start = list(a = 0.001, b = 90,  : 
  the function mle failed to estimate the parameters, 
                with the error code 100
In addition: Warning messages:
1: In fitdist(s, "truncated_log_normal", start = list(a = 0.001, b = 90,  :
  The dtruncated_log_normal function should return a vector of with NaN values when input has inconsistent values and not raise an error
2: In fitdist(s, "truncated_log_normal", start = list(a = 0.001, b = 90,  :
  The ptruncated_log_normal function should return a vector of with NaN values when input has inconsistent parameters and not raise an error

Sample of my vector with over 2000 elements:

> dput(head(s,150))
c(88.443, 89.296, 89.327, 87.776, 89.405, 89.824, 89.997, 87.678, 
89.665, 88.814, 88.841, 89.728, 89.365, 89.476, 89.189, 88.251, 
88.939, 89.945, 89.567, 89.613, 89.317, 89.622, 87.674, 89.19, 
89.782, 89.891, 89.954, 89.556, 89.093, 89.637, 89.052, 87.395, 
87.835, 89.357, 87.733, 89.459, 88.197, 88.539, 88.564, 87.857, 
88.74, 88.955, 89.691, 88.102, 89.635, 89.116, 89.584, 88.288, 
86.95, 89.182, 89.435, 88.93, 87.567, 89.083, 88.52, 88.897, 
89.54, 88.557, 89.269, 89.854, 89.31, 88.274, 89.126, 89.431, 
88.257, 88.872, 88.978, 89.03, 87.434, 88.305, 89.656, 87.556, 
89.209, 89.508, 87.781, 88.068, 89.933, 87.256, 88.906, 89.067, 
88.92, 87.947, 88.196, 88.951, 89.594, 88.378, 87.482, 88.817, 
89.65, 89.392, 89.932, 87.896, 89.909, 89.265, 89.954, 89.827, 
87.49, 87.786, 89.208, 89.728, 88.905, 87.566, 86.612, 88.363, 
87.457, 87.639, 88.907, 88.425, 87.244, 88.546, 88.221, 89.293, 
87.469, 87.31, 89.107, 88.442, 89.133, 88.812, 88.418, 89.456, 
88.512, 89.514, 87.446, 88.374, 89.282, 87.415, 89.004, 87.627, 
89.107, 89.168, 89.589, 89.288, 88.496, 89.807, 87.518, 88.796, 
88.001, 87.322, 87.353, 88.055, 88.81, 88.456, 87.876, 87.7, 
88.675, 88.996, 89.479, 86.781, 86.928, 87.356)

Is there a workaround for this in 2023? My data is already cleaned and contains no inconsistent, empty, NA or any other strange values.

Sol1
  • 21
  • 2
  • You should include a short sample vector `s` and all necessary `library()` calls so we can run the code. – user2554330 Mar 19 '23 at 10:32
  • The warnings aren't the problem, the problem is the `error code 100` from the `mle()` function. Without your data, I can't guess what it means. – user2554330 Mar 19 '23 at 10:38
  • `library(fitdistrplus) library(truncdist) s <- c(88.443, 89.296, 89.327, 87.776, 89.405, 89.824)` – Sol1 Mar 19 '23 at 10:44
  • You're trying to fit a truncated normal distribution from 6 data points? – Allan Cameron Mar 19 '23 at 11:40
  • Lognormal, and of course not, that's the head() of my vector with 2000 elements, it returns the exact same error as the full vector. – Sol1 Mar 19 '23 at 11:53
  • Could you perhaps include one or two hundred elements of your vector (editing your question rather than posting data necessary to run the example in the comments). This is easy to do using the `dput` function, e.g. `dput(head(s, 100))` – Allan Cameron Mar 19 '23 at 12:12
  • Yes, I've done it now. But I'm getting the error with basically any input vector I try, so I'm not sure that it has to do with the data. – Sol1 Mar 19 '23 at 12:36
  • I think you'll get more people helping if you make it easy to reproduce the example. Since you're a new contributor, I'll put all the bits and pieces together, but you would have gotten an answer faster if you'd done it yourself. – user2554330 Mar 19 '23 at 14:02

1 Answers1

2

The first problem is the restriction on the parameters: 0 < a < min(s) < max(s) < b and 0 < sdlog are requirements of the model. The optimizer doesn't know that, and runs into trouble because values that violate those restrictions generate lots of errors.

One way to deal with problems like this is to modify the parameters so they are unrestricted. For example, you could use log(a) as a parameter, because it's guaranteed to be positive, and use log(b-a) as another parameter, to guarantee b > a, and log(sdlog) as another.

The next problem is harder. For some of the values tried by the optimizer, the probability of the original distribution falling in the truncation interval evaluates to zero. Specifically, I saw this in debugging when a = 0.001, b = 90, sdlog = 0.009529981 (your starting values), and meanlog = 5.176146 (somewhat larger than the starting value).

The probability isn't really zero, it's some small value that rounds down to zero. The solution to this is to work with log probabilities instead of probabilities, but I don't think that option is available to you --- the code where this problem comes up is in truncdist, but the parameters are coming from fitdistrplus. Those two packages would need to work together to solve this, or perhaps you could write specialized versions of dtrunc and ptrunc to do it.

EDITED TO ADD:

The basic idea used by dtrunc is that the truncated density is equal to the regular density divided by the probability of being in the truncation interval, i.e. d/(pb - pa) where d is the full density, pb and pa are the CDF values at the endpoints. The numerical problem is that pb == pa due to rounding.

The solution to this is to rescale everything, and work on the log scale. That is, use (d/pb)/(1 - pa/pb) = exp(log(d) - log(pb) - log1p(-exp(log(pa) - log(pb)))).

The code below does this as well as solving the first problem. It doesn't use the truncdist package, it does the calculations using base functions.

And that's still not enough! Now the problem is that fitdist forces optim to compute a Hessian matrix, and numerical issues cause that to fail. There's a way around this: I'll define a "custom" optimization function, that is just the regular optim() with the hessian argument forced to be FALSE.

And here's the result. Sorry, no standard errors.

library(fitdistrplus) 
#> Loading required package: MASS
#> Loading required package: survival

s <- c(88.443, 89.296, 89.327, 87.776, 89.405, 89.824, 89.997, 87.678, 
       89.665, 88.814, 88.841, 89.728, 89.365, 89.476, 89.189, 88.251, 
       88.939, 89.945, 89.567, 89.613, 89.317, 89.622, 87.674, 89.19, 
       89.782, 89.891, 89.954, 89.556, 89.093, 89.637, 89.052, 87.395, 
       87.835, 89.357, 87.733, 89.459, 88.197, 88.539, 88.564, 87.857, 
       88.74, 88.955, 89.691, 88.102, 89.635, 89.116, 89.584, 88.288, 
       86.95, 89.182, 89.435, 88.93, 87.567, 89.083, 88.52, 88.897, 
       89.54, 88.557, 89.269, 89.854, 89.31, 88.274, 89.126, 89.431, 
       88.257, 88.872, 88.978, 89.03, 87.434, 88.305, 89.656, 87.556, 
       89.209, 89.508, 87.781, 88.068, 89.933, 87.256, 88.906, 89.067, 
       88.92, 87.947, 88.196, 88.951, 89.594, 88.378, 87.482, 88.817, 
       89.65, 89.392, 89.932, 87.896, 89.909, 89.265, 89.954, 89.827, 
       87.49, 87.786, 89.208, 89.728, 88.905, 87.566, 86.612, 88.363, 
       87.457, 87.639, 88.907, 88.425, 87.244, 88.546, 88.221, 89.293, 
       87.469, 87.31, 89.107, 88.442, 89.133, 88.812, 88.418, 89.456, 
       88.512, 89.514, 87.446, 88.374, 89.282, 87.415, 89.004, 87.627, 
       89.107, 89.168, 89.589, 89.288, 88.496, 89.807, 87.518, 88.796, 
       88.001, 87.322, 87.353, 88.055, 88.81, 88.456, 87.876, 87.7, 
       88.675, 88.996, 89.479, 86.781, 86.928, 87.356)

dtruncated_log_normal <- function(x, loga, logbminusa, meanlog, logsdlog) {

  a <- exp(loga)
  b <- exp(logbminusa) + a
  goodvals <- is.finite(x) & a <= x & x <= b
  sdlog <- exp(logsdlog)
  # cat("parms:", c(min(x[goodvals]), max(x[goodvals]), loga, a, logbminusa, b, meanlog, sdlog), "\n")
  logpab <- plnorm(c(a,b), meanlog, sdlog, log = TRUE)
  result <- rep(0, length(x))
  logd <- dlnorm(x[goodvals], meanlog, sdlog, log = TRUE)
  result[goodvals] <- exp(logd - logpab[2] - log1p(-exp(logpab[1]- logpab[2])))
  result
}

ptruncated_log_normal <- function(q, loga, logbminusa, meanlog, logsdlog) rep(NaN, length(q))

myoptim <- function(..., hessian) {
  optim(..., hessian = FALSE)
}

fit <- fitdist(s, "truncated_log_normal", start=list(loga=log(min(s) - 1), logbminusa=log(max(s) - min(s) + 2), meanlog=mean(log(s)), logsdlog=log(sd(log(s)))), custom.optim = myoptim)

fit
#> Fitting of the distribution ' truncated_log_normal ' by maximum likelihood 
#> Parameters:
#>             estimate Std. Error
#> loga        4.439900         NA
#> logbminusa  1.654508         NA
#> meanlog     4.490567         NA
#> logsdlog   -4.354114         NA

# Convert back to the original scale:
ests <- as.list(fit$estimate)
with(ests, {   
  a <- exp(loga)
  b <- exp(logbminusa) + a
  sdlog <- exp(logsdlog)
  c(a = a, b = b, meanlog = meanlog, sdlog = sdlog)
  })
#>           a           b     meanlog       sdlog 
#> 84.76649227 89.99700006  4.49056699  0.01285382

Created on 2023-03-19 with reprex v2.0.2

user2554330
  • 37,248
  • 4
  • 43
  • 90
  • Thanks. I didn't realize it was this complicated, I'll build off this if I can't find any simpler solution. What is it that makes my problem harder than the ones in the previous questions that I linked to in my original post? I thought I was dealing with exactly the same model as theirs, I just need to fit some truncated data to a lognormal model. – Sol1 Mar 19 '23 at 15:57
  • Truncated models are hard. I think they just got lucky; you didn't. – user2554330 Mar 19 '23 at 16:44
  • Aha, okay. I didn't realize it was so tricky, it's my first time working with truncated data. I'll have to have a close look at the literature, thanks! – Sol1 Mar 19 '23 at 16:57