8

I have scatterplots of 2D data from two categories. I want to add density lines for each dimension -- not outside the plot (cf. Scatterplot with marginal histograms in ggplot2) but right on the plotting surface. I can get this for the x-axis dimension, like this:

set.seed(123)
dim1 <- c(rnorm(100, mean=1), rnorm(100, mean=4))
dim2 <- rnorm(200, mean=1)
cat <- factor(c(rep("a", 100), rep("b", 100)))
mydf <- data.frame(cbind(dim2, dim1, cat))
ggplot(data=mydf, aes(x=dim1, y=dim2, colour=as.factor(cat))) + 
  geom_point() +
  stat_density(aes(x=dim1, y=(-2+(..scaled..))), 
  position="identity", geom="line")

It looks like this:

enter image description here

But I want an analogous pair of density curves running vertically, showing the distribution of points in the y-dimension. I tried

stat_density(aes(y=dim2, x=0+(..scaled..))), position="identity", geom="line)

but receive the error "stat_density requires the following missing aesthetics: x".

Any ideas? thanks

Community
  • 1
  • 1
D Swingley
  • 137
  • 7
  • 1
    I added your plot (pending review). This looks like a tough one. I wonder if `coord_flip` is useful here – C8H10N4O2 Jul 01 '15 at 18:39
  • 1
    This is interesting, although it's not what you're looking for: `ggplot(data=mydf, aes(x=dim1, y=dim2, colour=as.factor(cat))) + stat_density2d()` – C8H10N4O2 Jul 01 '15 at 19:33

3 Answers3

9

You can get the densities of the dim2 variables. Then, flip the axes and store them in a new data.frame. After that it is simply plotting them on top of the other graph.

p <- ggplot(data=mydf, aes(x=dim1, y=dim2, colour=as.factor(cat))) + 
  geom_point() +
  stat_density(aes(x=dim1, y=(-2+(..scaled..))), 
               position="identity", geom="line")

stuff <- ggplot_build(p)
xrange <- stuff[[2]]$ranges[[1]]$x.range  # extract the x range, to make the new densities align with y-axis

## Get densities of dim2
ds <- do.call(rbind, lapply(unique(mydf$cat), function(lev) {
    dens <- with(mydf, density(dim2[cat==lev]))
    data.frame(x=dens$y+xrange[1], y=dens$x, cat=lev)
}))

p + geom_path(data=ds, aes(x=x, y=y, color=factor(cat)))

enter image description here

Rorschach
  • 31,301
  • 5
  • 78
  • 129
  • 1
    Very nice solution. Kicking myself for not using `density` – C8H10N4O2 Jul 01 '15 at 19:36
  • thanks, it looks like ggplot might scale the y-values of the density a bit? Your curves after the coord_flip look a little bit "taller". – Rorschach Jul 01 '15 at 19:47
  • If the original dataset had a factor (4-way, let's say) that I was using to facet_grid, any thoughts on how to modify the density extraction do.call to pull that data out? – D Swingley Jul 02 '15 at 15:58
  • 1
    @DSwingley You might need a nested `lapply` to get densities, one for each facet and one for each category in the facets. – Rorschach Jul 02 '15 at 16:47
  • Sounds reasonable, thanks. I'm struggling to do the lapply part, will post if I succeed. So far errors with: mydf$mytype <- factor(rep(c("I", "J", "K", "L"),50)) and then ds3 <- do.call(rbind, lapply(unique(mydf$mytype), (lapply(unique(mydf$cat), function(lev) { dens <- with(mydf, density(dim2[cat==lev])) data.frame(x=dens$y+xrange[1], y=dens$x, cat=lev) })))) – D Swingley Jul 02 '15 at 20:48
  • @DSwingley post a new question if you want, the code will get too long for the comments I think. – Rorschach Jul 02 '15 at 20:54
2

So far I can produce:

distrib_horiz <- stat_density(aes(x=dim1, y=(-2+(..scaled..))), 
                              position="identity", geom="line")

ggplot(data=mydf, aes(x=dim1, y=dim2, colour=as.factor(cat))) + 
  geom_point() + distrib_horiz

enter image description here

And:

distrib_vert <- stat_density(data=mydf, aes(x=dim2, y=(-2+(..scaled..))), 
                             position="identity", geom="line") 

ggplot(data=mydf, aes(x=dim2, y=dim1, colour=as.factor(cat))) + 
  geom_point() + distrib_vert + coord_flip()

enter image description here

But combining them is proving tricky.

Community
  • 1
  • 1
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • retaining stat_density in a data structure seems like a nice start (if one could draw a line from it, ggplot wouldn't have to know where the x,y's came from). But I'm not sure how to access the density data. – D Swingley Jul 01 '15 at 19:15
1

So far I have only a partial solution since I didn't manage to obtain a vertical stat_density line for each individual category, only for the total set. Maybe this can nevertheless help as a starting point for finding a better solution. My suggestion is to try with the ggMarginal() function from the ggExtra package.

p <- ggplot(data=mydf, aes(x=dim1, y=dim2, colour=as.factor(cat))) + 
  geom_point() + stat_density(aes(x=dim1, y=(-2+(..scaled..))), 
           position="identity", geom="line")
library(ggExtra)
ggMarginal(p,type = "density", margins = "y", size = 4)

This is what I obtain: enter image description here

I know it's not perfect, but maybe it's a step in a helpful direction. At least I hope so. Looking forward to seeing other answers.

RHertel
  • 23,412
  • 5
  • 38
  • 64