0

I have an algorithm that uses an x,y plot of sorted y data to produce an ogive.

I then derive the area under the curve to derive %'s.

I'd like to do something similar using kernel density estimation. I like how the upper/lower bounds are smoothed out using kernel densities (i.e. the min and max will extend slightly beyond my hard coded input).

Either way... I was wondering if there is a way to treat an ogive as a type of cumulative distribution function and/or use kernel density estimation to derive a cumulative distribution function given y data?

I apologize if this is a confusing question. I know there is a way to derive a cumulative frequency graph (i.e. ogive). However, I can't determine how to derive a % given this cumulative frequency graph.

What I don't want is an ecdf. I know how to do that, and I am not quite trying to capture an ecdf. But, rather integration of an ogive given two intervals.

thistleknot
  • 1,098
  • 16
  • 38
  • 1
    If would be easier to give specific advice if you have some sort of [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – MrFlick Apr 29 '16 at 19:13
  • Despite your protestations to the contrary I think that you are attempting to reinvent the `ecdf` function as an intermediate to your goal. The end result would just be `ecdf(dat)(point2) - ecdf(dat)(point1)` – IRTFM Apr 29 '16 at 19:23

1 Answers1

2

I'm not exactly sure what you have in mind, but here's a way to calculate the area under the curve for a kernel density estimate (or more generally for any case where you have the y values at equally spaced x-values (though you can, of course, generalize to variable x intervals as well)):

library(zoo)

# Kernel density estimate
# Set n to higher value to get a finer grid
set.seed(67839)
dens = density(c(rnorm(500,5,2),rnorm(200,20,3)), n=2^5)

# How to extract the x and y values of the density estimate
#dens$y
#dens$x

# x interval
dx = median(diff(dens$x))

# mean height for each pair of y values
h = rollmean(dens$y, 2)

# Area under curve
sum(h*dx)  # 1.000943

# Cumulative area
# cumsum(h*dx)

# Plot density, showing points at which density is calculated 
plot(dens)
abline(v=dens$x, col="#FF000060", lty="11")

enter image description here

# Plot cumulative area under curve, showing mid-point of each x-interval
plot(dens$x[-length(dens$x)] + 0.5*dx, cumsum(h*dx), type="l")
abline(v=dens$x[-length(dens$x)] + 0.5*dx, col="#FF000060", lty="11")

enter image description here

UPDATE to include ecdf function

To address your comments, look at the two plots below. The first is the empirical cumulative distribution function (ECDF) of the mixture of normal distributions that I used above. Note that the plot of this data looks the same below as it does above. The second is a plot of the ECDF of a plain vanilla normal distribution, mean=0, sd=1.

set.seed(67839)
x = c(rnorm(500,5,2),rnorm(200,20,3))
plot(ecdf(x), do.points=FALSE)

plot(ecdf(rnorm(1000)))

enter image description here

eipi10
  • 91,525
  • 24
  • 209
  • 285
  • thank you. Is there also a way to do the this with the density plot itself? I'm not sure if I need that, but figured I'd ask. – thistleknot Apr 29 '16 at 22:39
  • As @42- mentioned, it sounds like you want a plot of the cumulative density: `plot(ecdf(rnorm(1000)))`. – eipi10 Apr 29 '16 at 22:44
  • but ecdf always results in a 45 degree angle when done cumulative. That never looks like a cdf, in your picture here it looks like a (Albeit weird) cdf. And again, a ecdf won't extend the edges as is done with a density plot (which uses kernel density estimation) – thistleknot Apr 29 '16 at 22:48
  • That's the shape of the ecdf of a normal distribution. The plot does go to the edges. `rnorm(1000)` will give values ranging (very roughly) from about -3.5 to 3.5. If you plot the mixture of normals I used in my answer, then you get the shape in my second plot. See updated answer. – eipi10 Apr 29 '16 at 23:00
  • Thank you very much. That ecdf doesn't look like a 45 degree line at all! – thistleknot Apr 29 '16 at 23:07
  • okay, after figuring out how to install zoo, I was able to replicate most of the setup. I imagine I'm supposed to divide something by sum(h*dx) but not sure how to derive the area under the curve for a specific x value. – thistleknot May 03 '16 at 17:01
  • I think I get it. the second graph is the area under the curve of the 1st graph. Now I want to do the area under the curve of an ecdf. – thistleknot May 03 '16 at 22:17