2

I am working with the cumulative emergence of flies over time (taken at irregular intervals) over many summers (though first I am just trying to make one year work). The cumulative emergence follows a sigmoid pattern and I want to create a maximum likelihood estimation of a 3-parameter Weibull cumulative distribution function. The three-parameter models I've been trying to use in the fitdistrplus package keep giving me an error. I think this must have something to do with how my data is structured, but I cannot figure it out. Obviously I want it to read each point as an x (degree days) and a y (emergence) value, but it seems to be unable to read two columns. The main error I'm getting says "Non-numeric argument to mathematical function" or (with slightly different code) "data must be a numeric vector of length greater than 1". Below is my code including added columns in the df_dd_em dataframe for cumulative emergence and percent emergence in case that is useful.

    degree_days <-   c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
                      1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
                      1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
                      2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
                      2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
                      2707.36,2773.82,2816.39,2863.94)
    emergence <-  c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
                   0,0,0,0,1,0,0,0,0,0)
    cum_em <- cumsum(emergence)
    df_dd_em <- data.frame (degree_days, emergence, cum_em)
    df_dd_em$percent <- ave(df_dd_em$emergence, FUN = function(df_dd_em) 100*(df_dd_em)/46)
    df_dd_em$cum_per <- ave(df_dd_em$cum_em, FUN = function(df_dd_em) 100*(df_dd_em)/46)
    x <- pweibull(df_dd_em[c(1,3)],shape=5)
    dframe2.mle <- fitdist(x, "weibull",method='mle')
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Sara E
  • 31
  • 5
  • `pweibull` doesn't take a data.frame as a parameter. What are you trying to do when you call that function? Are you trying to specify both `q` and `scale`? Even with a three-parameter model, you only have one random variable. What is your random variable exactly? – MrFlick Jul 22 '14 at 19:33
  • The random variable is emergence, but I want the model to understand emergence as a function of degree days (time) – Sara E Jul 22 '14 at 19:59
  • also scale is automatically set to 1 – Sara E Jul 22 '14 at 20:02
  • Well, a distribution is generally a univariate property; certainly for the weibull case at least. If you have two inputs, it doesn't make sense to do distribution fitting for both values simultaneously. `fitdist` seems completely inappropriate. If I believe height is normally distributed for all 25 year olds. It doesn't matter on what day I measure their height; the height's will wither be normally distributed or not. There is a single variable here at play. Now if i'm measuring a person's height over time, i'm not going to "fit a distribution" to that curve. I'd do a regression of some sort. – MrFlick Jul 22 '14 at 20:16

2 Answers2

3

Here's my best guess at what you're after:

Set up data:

dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
                      1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
                      1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
                      2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
                      2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
                      2707.36,2773.82,2816.39,2863.94),
                 emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
                 0,0,0,0,1,0,0,0,0,0))
dd <- transform(dd,cum_em=cumsum(emergence))

We're actually going to fit to an "interval-censored" distribution (i.e. probability of emergence between successive degree day observations: this version assumes that the first observation refers to observations before the first degree-day observation, you could change it to refer to observations after the last observation).

library(bbmle)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun <- function(scale,shape,x=dd$degree_days,y=dd$emergence) {
    prob <- pmax(diff(pweibull(c(-Inf,x),      ## or (c(x,Inf))
             shape=shape,scale=scale)),1e-6)
    ## multinomial probability
    -sum(y_log_p(y,prob))
}    
library(bbmle)

I should probably have used something more systematic like the method of moments (i.e. matching the mean and variance of a Weibull distribution with the mean and variance of the data), but I just hacked around a bit to find plausible starting values:

## preliminary look (method of moments would be better)
scvec <- 10^(seq(0,4,length=101))
plot(scvec,sapply(scvec,NLLfun,shape=1))

It's important to use parscale to let R know that the parameters are on very different scales:

startvals <- list(scale=1000,shape=1)
m1 <- mle2(NLLfun,start=startvals,
     control=list(parscale=unlist(startvals)))

Now try with a three-parameter Weibull (as originally requested) -- requires only a slight modification of what we already have:

library(FAdist)
NLLfun2 <- function(scale,shape,thres,
                    x=dd$degree_days,y=dd$emergence) {
    prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
                 1e-6)
    ## multinomial probability
    -sum(y_log_p(y,prob))
}    
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
     control=list(parscale=unlist(startvals2)))

Looks like the three-parameter fit is much better:

library(emdbook)
AICtab(m1,m2)
##    dAIC df
## m2  0.0 3 
## m1 21.7 2 

And here's the graphical summary:

with(dd,plot(cum_em~degree_days,cex=3))
with(as.list(coef(m1)),curve(sum(dd$emergence)*
                             pweibull(x,shape=shape,scale=scale),col=2,
                             add=TRUE))
with(as.list(coef(m2)),curve(sum(dd$emergence)*
                             pweibull3(x,shape=shape,
                                       scale=scale,thres=thres),col=4,
                             add=TRUE))

enter image description here

(could also do this more elegantly with ggplot2 ...)

  • These don't seem like spectacularly good fits, but they're sane. (You could in principle do a chi-squared goodness-of-fit test based on the expected number of emergences per interval, and accounting for the fact that you've fitted a three-parameter model, although the values might be a bit low ...)
  • Confidence intervals on the fit are a bit of a nuisance; your choices are (1) bootstrapping; (2) parametric bootstrapping (resample parameters assuming a multivariate normal distribution of the data); (3) delta method.
  • Using bbmle::mle2 makes it easy to do things like get profile confidence intervals:

 confint(m1)
 ##             2.5 %      97.5 %
 ## scale 1576.685652 1777.437283
 ## shape    4.223867    6.318481
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thank you @Ben, but where do the -Inf and 1e-6 come from in the model fittings? I actually need to fit the model in terms of cumulative percent throughout the season so I can plot over multiple years. When I try to alter this model to use the cumulative percent the curve doesn't fit- it goes off the plot rather than matching the data points. 'code' NLLfun2 <- function(scale,shape,thres, x=dd$degree_days,y=dd$cum_per) { prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)), 1e-6) ## multinomial probability -sum(y_log_p(y,prob)) } – Sara E Jul 23 '14 at 03:19
  • I think you're misunderstanding; my code *does* in effect fit the cumulative percentage (see the graph?) -- it just does it in practice by comparing the emergence by period, which is a more plausible statistical model. The `-Inf` is there to signify the beginning point of the first period (0 would probably work just as well); the 1e-6 is there to make sure we don't get emergence probabilities of exactly zero, which will mess up the fitting process (if your starting values are good enough you probably don't need that bit) – Ben Bolker Jul 23 '14 at 13:27
0
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
                           1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
                           1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
                           2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
                           2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
                           2707.36,2773.82,2816.39,2863.94),
             emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
                         0,0,0,0,1,0,0,0,0,0))

dd$cum_em <- cumsum(dd$emergence)

dd$percent <- ave(dd$emergence, FUN = function(dd) 100*(dd)/46)

dd$cum_per <- ave(dd$cum_em, FUN = function(dd) 100*(dd)/46)

dd <- transform(dd)


#start 3 parameter model

library(FAdist)

## y*log(p) allowing for 0/0 occurrences:

y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))

NLLfun2 <- function(scale,shape,thres,
                x=dd$degree_days,y=dd$percent) {
  prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
           1e-6)
   ## multinomial probability
  -sum(y_log_p(y,prob))
} 

startvals2 <- list(scale=1000,shape=1,thres=100)

m2 <- mle2(NLLfun2,start=startvals2,
       control=list(parscale=unlist(startvals2)))

summary(m2)

#graphical summary

windows(5,5)

with(dd,plot(cum_per~degree_days,cex=3))

with(as.list(coef(m2)),curve(sum(dd$percent)*
                           pweibull3(x,shape=shape,
                                     scale=scale,thres=thres),col=4,
                         add=TRUE))

enter image description here

Sara E
  • 31
  • 5
  • ?? can you explain how this is something other than a slightly modified version of my answer ?? – Ben Bolker Jul 24 '14 at 01:03
  • It is the same as yours. I'm sorry, i'm new to the site. I just thought I was supposed to post exactly what I had been trying to plot to close the question. I Appologize if this was not the correct structure to close the question. – Sara E Jul 24 '14 at 04:59
  • The idea is that if you think my answer solves your problem, you can click on the check mark next to my answer. That signals that the problem has been answered satisfactorily. You are welcome to post your own answer to the problem (and accept it!) if you find a different/better solution, but it doesn't really make sense to do so if it's 97% redundant with my answer ... – Ben Bolker Jul 24 '14 at 05:09
  • Done and done. I'm sorry for my confusion on proper procedure. Thanks for your help. – Sara E Jul 24 '14 at 05:28