Using optimize() to find the shortest interval that takes 95% area under a curve in R

Question

Background:

I have a curve whose Y-values are produced by my small R function below (neatly annotated). If you run my entire R code, you see my curve (but remember, it's a function so if I changed the argument values, I could get a different curve):

Question:

Obviously, one can determine/assume many intervals that would cover/take 95% of the total area under this curve. But using, optimize(), how can I find the SHORTEST (in x-value units) of these many possible 95% intervals? What then would be the corresponding x-values for the the two ends of this shortest 95% interval?

Note: The idea of shortest interval for a uni-modal curve like mine makes sense. In reality, the shortest one would be the one that tends to be toward the middle where the height (y-value) is larger, so then x-value doesn't need to be so large for the intended interval to cover/take 95% of the total area under the curve.

Here is my R code (please run the entire code):

ppp <- function(f, N, df1, df2, petasq, alpha, beta) {

 pp <- function(petasq) dbeta(petasq, alpha, beta)
 ll <- function(petasq) df(f, df1, df2, (petasq * N) / (1 - petasq) )

 marg <- integrate(function(x) pp(x)*ll(x), 0, 1)[[1]]

po <- function(x) pp(x)*ll(x) / marg
return(po(petasq) )

}
## @@@ END OF MY R FUNCTION.

# Now I use my function above to get the y-values for my plot:

petasq  <- seq(0, 1, by = .0001) ## These are X-values for my plot
f  <- 30       # a function needed argument
df1 <- 3       # a function needed argument
df2 <- 108     # a function needed argument
N  <- 120      # a function needed argument
alpha = 5      # a function needed argument
beta = 4       # a function needed argument


## Now use the ppp() function to get the Y-values for the X-value range above:
y.values <- ppp(f, N, df1, df2, petasq, alpha, beta)

## Finally plot petasq (as X-values) against the Y.values:
plot(petasq, y.values, ty="l", lwd = 3 )

score 2 · Answer 1 · answered Apr 07 '17 at 20:57

If we think of this as trying to calculate the interval with the smallest area, we can start calculating the areas of each of the regions we are plotting. We can then find the largest area (which presumably will be near the center) and start walking out till we found the area we are looking for.

Since you've already calculate the x and y values for the plot, i'll reuse those to save some calculations. Here's an implementation of that algorithm

pseduoarea <- function(x, y, target=.95) {
  dx <- diff(x)
  areas <- dx * .5 * (head(y,-1) + tail(y, -1))
  peak <- which.max(areas)
  range <- c(peak, peak)
  found <- areas[peak]
  while(found < target) {
    if(areas[range[1]-1] > areas[range[2]+1]) {
      range[1] <- range[1]-1
      found <- found + areas[range[1]-1]
    } else {
      range[2] <- range[2]+1
      found <- found + areas[range[2]+1]
    }   
  }
  val<-x[range]
  attr(val, "indexes")<-range
  attr(val, "area")<-found
  return(val)
}

And we call it with

pseduoarea(petasq, y.values)
# [1] 0.3194 0.5413

This does assume that all the values in petasq are equally spaced

MrFlick, is it possible that your function above is modified so that it can also find the 95% intervals in `half-curves` [i.e., truncated distributions]? — rnorouzian, Dec 23 '17 at 05:00

thc · Accepted Answer · 2017-04-08T08:52:04.877

Based on your revised question, I found the optimization that minimizes the SHORTEST distance (in x-value units) between LEFT and RIGHT boundaries:

ppp <- function(petasq, f, N, df1, df2, alpha, beta) {

 pp <- function(petasq) dbeta(petasq, alpha, beta)
 ll <- function(petasq) df(f, df1, df2, (petasq * N) / (1 - petasq) )

 marg <- integrate(function(x) pp(x)*ll(x), 0, 1)[[1]]

po <- function(x) pp(x)*ll(x) / marg
return(po(petasq) )
}

petasq  <- seq(0, 1, by = .0001) ## These are X-values for my plot
f  <- 30       # a function needed argument
df1 <- 3       # a function needed argument
df2 <- 108     # a function needed argument
N  <- 120      # a function needed argument
alpha = 5      # a function needed argument
beta = 4       # a function needed argument

optim_func <- function(x_left) {
    int_function <- function(petasq) {
        ppp(petasq, f=f, N=N, df1=df1, df2=df2, alpha=alpha, beta=beta)
    }

    # For every LEFT value, find the corresponding RIGHT value that gives 95% area.  

    find_95_right <- function(x_right) {
        (0.95 - integrate(int_function, lower=x_left, upper=x_right, subdivisions = 10000)$value)^2
    }
    x_right_obj <- optimize(f=find_95_right, interval=c(0.5,1))
    if(x_right_obj$objective > .Machine$double.eps^0.25) return(100)

    #Return the DISTANCE BETWEEN LEFT AND RIGHT
    return(x_right_obj$minimum - x_left)
}

#MINIMIZE THE DISTANCE BETWEEN LEFT AND RIGHT
x_left <- optimize(f=optim_func, interval=c(0.30,0.40))$minimum
find_95_right <- function(x_right) {
    (0.95 - integrate(int_function, lower=x_left, upper=x_right, subdivisions = 10000)$value)^2
}
    int_function <- function(petasq) {
        ppp(petasq, f=f, N=N, df1=df1, df2=df2, alpha=alpha, beta=beta)
    }
x_right <- optimize(f=find_95_right, interval=c(0.5,1))$minimum

See the comments in the code. Hopefully this finally satisfies your question :) Results:

> x_right
[1] 0.5409488
> x_left
[1] 0.3201584

Also, you can plot the distance between LEFT and RIGHT as a function of the left boundary:

left_x_values <- seq(0.30, 0.335, 0.0001)
DISTANCE <- sapply(left_x_values, optim_func)

plot(left_x_values, DISTANCE, type="l")

I get the following error after running your last line: `> x_right <- optimize(f=find_95_right, interval=c(0.5,1))$minimum Error in match.fun(f) : object 'int_function' not found Called from: match.fun(f)` — rnorouzian, Apr 08 '17 at 04:08
thc, I truly (very very much) appreciate your response, and expertise. Please allow me to test this thoroughly to see if it gives the shortest 95% interval of all. And I will let you know. Thank you so very much. — rnorouzian, Apr 08 '17 at 13:53
Oh gosh!, thc, your method produces completely wrong results! Just change `f` from `30` to `f = 5` to see how erroneous the results will be!!! — rnorouzian, Apr 08 '17 at 15:31
The search interval is set from 0.3 to 0.4. It actually works perfectly.... Change the search interval... don't assume everything will just be spoonfed. — thc, Apr 09 '17 at 06:24
Do please accept my apologies for not being completely familiar with `optimize()` in R, so, the entire range of support (x-axis values) is bound between 0 and 1. And the "Peak" of the curve can be anywhere. So, for the left_x and right_x to be always (no matter how the input values change), how should I best set the interval for x_left and x_right? — rnorouzian, Apr 09 '17 at 14:22
In this line: x_left <- optimize(f=optim_func, interval=c(0.30,0.40))$minimum, change the interval to approx. where the LEFT x value will be. — thc, Apr 09 '17 at 22:05
Oh, if this is to be each time manually adjusted, then it would be very difficult to use it in a dynamically changing process where input values could be anything. — rnorouzian, Apr 09 '17 at 22:22

IRTFM · Answer 3 · 2017-04-08T20:11:39.680

0

I don't think you need to use optimize (unless this were part of an unadmitted homework assignment). Instead just normalize a cumulative sum and figure out at which points your criteria are met:

> which(cusm.y >= 0.025)[1]
[1] 3163
> which(cusm.y >= 0.975)[1]
[1] 5375

You can check that these are reasonable indices to use for the pulling values from the petasq vector with:

abline( v= c( petasq[  c( which(cusm.y >= 0.025)[1], which(cusm.y >= 0.975)[1])]),
        col="red")

This is admittedly equivalent to constructing an integration function with a normalization constant across the domain of the "density" function. The fact that the intervals are all of equal dimension allows omitting the differencing of "x"-vector from the height times base calculation.

I suppose there is another interpretation possible. That would require that we discover how many values of an ascending-sorted version of petasq are needed to sum to 95% of the total sum. This gives a different strategy and the plot shows where a horizontal line would intersect the curve:

which( cumsum( sort( y.values, decreasing=TRUE) ) > 0.95* sum(y.values, na.rm=TRUE) )[1]
#[1] 2208
sort( y.values, decreasing=TRUE)[2208]
#[1] 1.059978
png()
  plot(petasq, y.values, ty="l", lwd = 3 )
  abline( h=sort( y.values, decreasing=TRUE)[2208], col="blue")
dev.off()

To get the petasq values you would need to determine the first y.values that exceeded that value and then the next y.values that dropped below that level. These can be obtained via:

order(y.values, decreasing=TRUE)[2208]
#[1] 3202
order(y.values, decreasing=TRUE)[2209]
#[1] 5410

And then the plot would look like:

png(); plot(petasq, y.values, ty="l", lwd = 3 )
      abline( v=  petasq[  c(3202, 5410)], col="blue", lty=3, lwd=2)
dev.off()

The area between the two dotted blue lines is 95% of the total area above the zero line:

edited Apr 08 '17 at 20:11

answered Apr 07 '17 at 21:23

IRTFM

258,963
21
364
487

Thanks for the input! Not a homework assignment or ANYTHING related to that! – rnorouzian Apr 07 '17 at 21:24
@parvin karimi. Do you care to explain what you find "wrong" about it? – IRTFM Apr 07 '17 at 21:28
Yes. That's what I _thought_ was requested. If something different was desired, then you should edit your question to make the request more clear. You have still not upvoted any answer, much less checkmarked one and I'm not the first person to apparently misunderstand your Q. – IRTFM Apr 07 '17 at 21:41
1

Please [edit] your question. "Highest density part" has been found to be ambiguous to several of us. You need to become more mathematically precise in your request. – IRTFM Apr 07 '17 at 21:44
(Perhaps) I now understand. – IRTFM Apr 07 '17 at 22:02
-42, could you please explain about your code in the yellow-box? It doesn't run the way it appears in the yellow box, I mean: `order(y.values, decreasing=TRUE)2208 3202 order(y.values, decreasing=TRUE)2209 5410` – rnorouzian Apr 08 '17 at 16:52
I just forgot to indent a code block. I cleaned it up. – IRTFM Apr 08 '17 at 20:10
so your SECOND strategy is that you create a horizontal line and starting from the top of the curve, bring the horizontal line down up until the area above the horizontal line has covered 95% of the total area of the curve? Is my understanding of your second strategy correct? – rnorouzian Apr 25 '17 at 00:06

Using optimize() to find the shortest interval that takes 95% area under a curve in R

Question:

Here is my R code (please run the entire code):

3 Answers3