2

I am using the package ks for kernel density estimation. Here's an easy example:

n <- 70
x <- rnorm(n)

library(ks)
f_kde <- kde(x) 

I am actually interested in the respective exceeding probabilities of my input data, which can be easily returned by ks having f_kde:

p_kde <- pkde(x, f_kde)

This is done in ks with a numerical integration using Simpson's rule. Unfortunately, they only implemented this for a 1d case. In a bivariate case, there's no implementation in ks of any method for returning the probabilities :

y <- rnorm(n)
f_kde <- kde(data.frame(x,y))
# does not work, but it's what I am looking for:
p_kde <- pkde(data.frane(x,y), f_kde) 

I couldnt find any package or help searching in stackoverflow to solve this issue in R (some suggestions for Python exist, but I would like to keep it in R). Any line of code or package recommendation is appreciated. Even though I am mostly interested in the bivariate case, any ideas for a multivariate case are appreciated as well.

Machavity
  • 30,841
  • 27
  • 92
  • 100
Felix Phl
  • 383
  • 1
  • 13
  • Can't you just do a double integral like in [This previous question](https://stackoverflow.com/q/43189512/4752675) – G5W Jul 18 '20 at 15:33
  • @G5W I already realized that `pracma::simpson2d` does what I am looking for, but it requires a function as input and I don't know how to code a bivariate kernel density by myself and also have no idea where to start from – Felix Phl Jul 19 '20 at 16:48

1 Answers1

3

kde allows multidimensional kernel estimate, so we could use kde to calculate pkde.
For this, we calculate kde on small enough dx and dy steps using eval.points parameter : this gives us the local density estimate on a dx*dy square.
We verify that the sum of estimates mutiplied by the surface of the squares almost equals 1:

library(ks)
set.seed(1)
n <- 10000
x <- rnorm(n)
y <- rnorm(n)
xy <- cbind(x,y)

xmin <- -10
xmax <- 10
dx <- .1

ymin <- -10
ymax <- 10
dy <- .1

pts.x <- seq(xmin, xmax, dx)
pts.y <- seq(ymin, ymax, dy)
pts <- as.data.frame(expand.grid(x = pts.x, y = pts.y))
f_kde <- kde(xy,eval.points=pts)

pts$est <- f_kde$estimate

sum(pts$est)*dx*dy
[1] 0.9998778

You can now query the pts dataframe for the cumulative probability on the area of your choice :

library(data.table)
setDT(pts)
# cumulative density
pts[x < 1 & y < 2 , .(pkde=sum(est)*dx*dy)]
        pkde
1: 0.7951228

# average density around a point
tolerance <-.1
pts[pmin(abs(x-1))<tolerance & pmin(abs(y-2))<tolerance, .(kde = mean(est))]
          kde
1: 0.01465478
Waldi
  • 39,242
  • 6
  • 30
  • 78
  • Thank you for your answer. I have two questions: 1. I tested the code with my own data (`n = 69`) and the sum of p is only about 0.9, should I worry about it? 2. I have trouble converting the last line for my purpose, could you show how to estimate the probability density of `xy[1,]` ( = `x[1]` and `y[1]`), thank you! – Felix Phl Jul 21 '20 at 09:43
  • Regarding question 1) kde on only 69 points for a 100*100 raster is still very imprecise, so I'm not surprised you get a sum about 0.9. For 2) pkde is the cumulative density, that's why I made a sum of two inequalities. If you want the density on one point you can use `pts[ x==1 & y==1] ` provided that the point you're asking for is in the raster. – Waldi Jul 21 '20 at 11:06