7

I have a simple x, y data.frame.

mydata <- data.frame(days = 1:96, risk = c(5e-09, 5e-09, 5e-09, 1e-08, 4e-08, 6e-08, 9e-08, 1.5e-07, 4.2e-07, 
                                           7.2e-07, 1.02e-06, 1.32e-06, 1.66e-06, 2.19e-06, 2.76e-06, 3.32e-06, 
                                           3.89e-06, 4.55e-06, 5.8e-06, 7.16e-06, 8.51e-06, 9.85e-06, 1.138e-05, 
                                           1.396e-05, 1.672e-05, 1.947e-05, 2.222e-05, 2.521e-05, 2.968e-05, 
                                           3.439e-05, 3.909e-05, 4.378e-05, 4.894e-05, 5.697e-05, 6.546e-05, 
                                           7.392e-05, 8.236e-05, 9.16e-05, 0.00010573, 0.00012063, 0.00013547, 
                                           0.00015025, 0.00016642, 0.00019127, 0.00021743, 0.00024343, 0.00026924, 
                                           0.00029818, 0.00034681, 0.00039832, 0.00044932, 0.00049976, 0.0005451, 
                                           0.00056293, 0.00057586, 0.00058838, 0.0006005, 0.00061562, 0.00065079, 
                                           0.00068845, 0.00072508, 0.00076062, 0.00079763, 0.00084886, 0.00090081, 
                                           0.0009507, 0.00099844, 0.00104427, 0.00108948, 0.00113175, 0.00117056, 
                                           0.00120576, 0.00123701, 0.00126253, 0.00128269, 0.00129757, 0.00130716, 
                                           0.00131291, 0.00132079, 0.0013216, 0.00131392, 0.00129806, 0.00127247, 
                                           0.00122689, 0.00117065, 0.00110696, 0.00103735, 0.00095951, 0.00085668, 
                                           0.0007517, 0.00065083, 0.000556, 0.0004669, 0.00037675, 0.00029625, 
                                           0.00093289))

I think Weibull(3, 0.155) is a pretty good fit to my data, judging from the plot below.

plot(1:96, dweibull(mydata$risk, shape = 3, scale = 0.155), type = "l", xlab = "days", ylab = "risk")
lines(mydata, type = "l", col = "grey")
legend("topleft", c("Data", "Estimate"), col = c("black", "grey"), lty = c(1, 1))

enter image description here

I write a function that calculates the negative log-likelihood, which will be passed into mle.

estimate <- function(kappa, lambda){
  -sum(dweibull(mydata$y, shape = kappa, scale = lambda, log = TRUE))
}

I call mle, provide my initial parameter estimates, and obtain the following error.

> mle(estimate, start = list(kappa = 3, lambda = 0.155))
Error in optim(start, f, method = method, hessian = TRUE, ...) : 
  non-finite finite-difference value [2]
In addition: There were 50 or more warnings (use warnings() to see the first 50)

What went wrong here?

Adrian
  • 9,229
  • 24
  • 74
  • 132

1 Answers1

2

What do you want to do? From what I can tell, you have a dataset of 96 values of "risk", and you want to fit it's distribution with weibull. Note that "days" is not relevant at all if this is the case. You have an unordered vector of values.

The figure above is misleading. You calculate dweibull() for risk values. The figure indicates that dweibull(risk) is roughly equal to risk. This is a rather different claim than weibull with given parameters being a good fit.

for instance, here is the distribution of your data: hist(mydata$risk, breaks=15) enter image description here while the weibull density with your parameters in the relevant range looks like this: curve((function(x) dweibull(x, shape=3, scale=0.155))(x), 0, 0.0014) enter image description here

Hence these distributions are very different. I would say your empirical distributions is uniform plus mass at zero, not weibull.

Now to your last problem: as the distributions do not fit well, the optimizer runs into numerical singularities. I don't know mle() too well, but with a little tweaks maxLik::maxLik() will show the problem:

estimate <- function(par){
   Kappa <- par[1]
   Lambda <- par[2]
   dweibull(mydata$risk, shape = Kappa, scale = Lambda, log = TRUE)
}
summary(maxLik::maxLik(estimate, start=c(Kappa=3, Lambda=0.155), method="BHHH"))

gives you

--------------------------------------------
Maximum Likelihood estimation
BHHH maximisation, 43 iterations
Return code 2: successive function values within tolerance limit
Log-Likelihood: 682.743 
2  free parameters
Estimates:
        Estimate Std. error t value Pr(> t)    
Kappa  0.4849129  0.0473720  10.236 < 2e-16 ***
Lambda 0.0002953  0.0001028   2.873 0.00407 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------

Note that I did one major change: removing sum from your log-likelihood, and using BHHH optimizer. This typically more stable than optimizing based on a single summed likelihood. You should also seriously consider writing analytic derivatives for the estimation.

You can check that the distributions look a lot more similar now.

Ott Toomet
  • 1,894
  • 15
  • 25
  • Thank you. Regarding your last comment, how can I check that the distributions look a lot more similar now? I tried `plot(dweibull(1:95, shape = 0.4838894, scale = 0.0002961))` but it looks different from the distribution of my data? – Adrian Oct 02 '17 at 14:35
  • You can start with something simple like qqplot. You can also see how similar are the moments and other characteristics of these two distributions. And finally, you can calculate some sort of distance between these two distributions--for instance, Kulback-Leibler distance (which is the same as log-likelihood). The latter will also work for certain tests. – Ott Toomet Oct 02 '17 at 14:40
  • Is there a quick graphical way to see if the two distributions are similar? Along the lines of plotting the densities. – Adrian Oct 02 '17 at 14:57
  • `fitdistrplus::plotdist(risk, "weibull", list(shape=0.4838894, scale=0.0002961))` is doing quite a good job in my opinion. (`risk` is your risk variable). – Ott Toomet Oct 02 '17 at 15:08
  • Thank you. I would like to clarify that `mydata` actually contains densities (I do not have the raw data), so I think my first plot (where I compared weibull densities to `mydata$risk`) shows a good fit. In your response we're technically looking at the densities of `mydata$risk`, i.e. densities of densities...correct? – Adrian Oct 02 '17 at 15:13
  • Yes. I assumed `mydata` contains your random variables, not densities... – Ott Toomet Oct 02 '17 at 15:15
  • That's why I'm confused as to why plotting `plot(dweibull(1:95, shape = 0.4838894, scale = 0.0002961))` looks very different from `lines(mydata$risk, col = "grey)` – Adrian Oct 02 '17 at 15:15
  • I see. Sorry for the confusion. – Adrian Oct 02 '17 at 15:16
  • Indeed, sorry for that. Here you may consider using NLLS for estimation. – Ott Toomet Oct 02 '17 at 15:42