stat_density2d - What does the legend mean?

Question

I have a map done in R with stat_density2d. This is the code:

ggplot(data, aes(x=Lon, y=Lat)) + 
  stat_density2d(aes(fill = ..level..), alpha=0.5, geom="polygon",show.legend=FALSE)+
  geom_point(colour="red")+
  geom_path(data=map.df,aes(x=long, y=lat, group=group), colour="grey50")+
  scale_fill_gradientn(colours=rev(brewer.pal(7,"Spectral")))+
  xlim(-10,+2.5) +
  ylim(+47,+60) +
  coord_fixed(1.7) +
  theme_void()

And it produces this:

Great. It works. However I do not know what the legend means. I did find this wikipedia page:

https://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation

And the example they used (which contains red, orange and yellow) stated:

The coloured contours correspond to the smallest region which contains the respective probability mass: red = 25%, orange + red = 50%, yellow + orange + red = 75%

However, using stat_density2d, I have 11 contours in my map. Does anyone know how stat_density2d works and what the legend means? Ideally I wanted to be able to state something like the red contour contains 25% of the plots etc.

I have read this: https://ggplot2.tidyverse.org/reference/geom_density_2d.html and I am still none the wiser.

https://stackoverflow.com/questions/32206623/what-does-level-mean-in-ggplotstat-density2d — hrbrmstr, Nov 06 '18 at 13:26
I saw that you posted there, but I am still unsure how I am meant to interpret the results i.e. what does each level mean? I ask as someone is asking me what each contour means and I cannot explain. Is it as simple as "If there are 10 levels, that means each level is 10%", so the data marked red is the most dense 10% and the next level is the most dense 20% etc — Nicholas, Nov 06 '18 at 14:57

hrbrmstr · Accepted Answer · 2018-11-06T18:18:08.393

Let's take the faithful example from ggplot2:

ggplot(faithful, aes(x = eruptions, y = waiting)) +
  stat_density_2d(aes(fill = factor(stat(level))), geom = "polygon") +
  geom_point() +
  xlim(0.5, 6) +
  ylim(40, 110)

(apologies in advance for not making this prettier)

The level is the height at which the 3D "mountains" were sliced. I don't know of a way (others might) to translate that to a percentage but I do know to get you said percentages.

If we look at that chart, level 0.002 contains the vast majority of the points (all but 2). Level 0.004 is actually 2 polygons and they contain all but ~dozen of the points. If I'm getting the gist of what you're asking that's what you want to know, except not count but the percentage of points encompassed by polygons at a given level. That's straightforward to compute using the methodology from the various ggplot2 "stats" involved.

Note that while we're importing the tidyverse and sp packages we'll use some other functions fully-qualified. Now, let's reshape the faithful data a bit:

library(tidyverse)
library(sp)

xdf <- select(faithful, x = eruptions, y = waiting)

(easier to type x and y)

Now, we'll compute the two-dimensional kernel density estimation the way ggplot2 does:

h <- c(MASS::bandwidth.nrd(xdf$x), MASS::bandwidth.nrd(xdf$y))

dens <- MASS::kde2d(
  xdf$x, xdf$y, h = h, n = 100,
  lims = c(0.5, 6, 40, 110)
)

breaks <- pretty(range(zdf$z), 10)

zdf <- data.frame(expand.grid(x = dens$x, y = dens$y), z = as.vector(dens$z))

z <- tapply(zdf$z, zdf[c("x", "y")], identity)

cl <- grDevices::contourLines(
  x = sort(unique(dens$x)), y = sort(unique(dens$y)), z = dens$z,
  levels = breaks
)

I won't clutter the answer with str() output but it's kinda fun looking at what happens there.

We can use spatial ops to figure out how many points fall within given polygons, then we can group the polygons at the same level to provide counts and percentages per-level:

SpatialPolygons(
  lapply(1:length(cl), function(idx) {
    Polygons(
      srl = list(Polygon(
        matrix(c(cl[[idx]]$x, cl[[idx]]$y), nrow=length(cl[[idx]]$x), byrow=FALSE)
      )),
      ID = idx
    )
  })
) -> cont

coordinates(xdf) <- ~x+y

data_frame(
  ct = sapply(over(cont, geometry(xdf), returnList = TRUE), length),
  id = 1:length(ct),
  lvl = sapply(cl, function(x) x$level)
) %>% 
  count(lvl, wt=ct) %>% 
  mutate(
    pct = n/length(xdf),
    pct_lab = sprintf("%s of the points fall within this level", scales::percent(pct))
  )
## # A tibble: 12 x 4
##      lvl     n    pct pct_lab                              
##    <dbl> <int>  <dbl> <chr>                                
##  1 0.002   270 0.993  99.3% of the points fall within this level
##  2 0.004   259 0.952  95.2% of the points fall within this level
##  3 0.006   249 0.915  91.5% of the points fall within this level
##  4 0.008   232 0.853  85.3% of the points fall within this level
##  5 0.01    206 0.757  75.7% of the points fall within this level
##  6 0.012   175 0.643  64.3% of the points fall within this level
##  7 0.014   145 0.533  53.3% of the points fall within this level
##  8 0.016    94 0.346  34.6% of the points fall within this level
##  9 0.018    81 0.298  29.8% of the points fall within this level
## 10 0.02     60 0.221  22.1% of the points fall within this level
## 11 0.022    43 0.158  15.8% of the points fall within this level
## 12 0.024    13 0.0478  4.8% of the points fall within this level

I only spelled it out to avoid blathering more but the percentages will change depending on how you modify the various parameters to the density computation (same holds true for my ggalt::geom_bkde2d() which uses a different estimator).

If there is a way to tease out the percentages without re-performing the calculations there's no better way to have that pointed out than by letting other SO R folks show how much more clever they are than the person writing this answer (hopefully in more diplomatic ways than seem to be the mode of late).

Hey. I cannot put into words how helpful your response is. That is the answer I wanted and more! Thank you for the time you have taken to respond. As soon as I'm back at work tomorrow I'll run the additional code so I have a response to them, they will be really chuffed! — Nicholas, Nov 06 '18 at 18:17
Glad to assist and I may PR into ggplot2 to add this as an additional computed stat that can be used for the legend vs the level as I, too, think it's more useful. — hrbrmstr, Nov 06 '18 at 18:22
Hey, the code runs but not getting the response I wanted unfortunately i.e. it is only showing 9 levels instead of 12, and 5 of the levels has no data. I noticed that you created breaks before creating zdf so I moved that around. Is there something I have to change to make it fit my data? I am using longitutde and latitude and my 'xdf ' contains 664 observations of those lon/lat. — Nicholas, Nov 07 '18 at 11:25
This is my 'xdf' data if you wanted a look: https://1drv.ms/u/s!AoBBWjKJEd_Eg9BqSK0gAkc0_3uCAQ — Nicholas, Nov 07 '18 at 11:26
And do not worry if you do not have time to look. I can try and work it out myself :) — Nicholas, Nov 07 '18 at 11:27
#ty! I rarely get to work with spatial data in the $DAYJOB (I work in cybersecurity) so anytime I get to play with real world spatial data is a treat! — hrbrmstr, Nov 07 '18 at 12:06
ahhh great, thank you! :).... I'm slowly starting to enjoy spatial data — Nicholas, Nov 07 '18 at 13:02

stat_density2d - What does the legend mean?

1 Answers1