11

Is there a general way to draw densities (violin plots) or histograms showing the distribution of x along a smooth (x,y) curve? I use this approach to show the marginal distribution of x when there are multiple groups (e.g., different curves on one panel, delineated by differing colors).

Here is an example using the Hmisc package's plsmo function to get stratified loess curves and spike histograms showing the sex-specific data density for age.

require(Hmisc)
set.seed(1)
age <- rnorm(500, 50, 15)
y <- sample(0:1, 500, TRUE)
sex <- sample(c('female','male'), 500, TRUE)
plsmo(age, y, group=sex, col=1:2,
      datadensity=TRUE, scat1d.opts=list(nhistSpike=20))

enter image description here

Frank Harrell
  • 1,954
  • 2
  • 18
  • 36
  • 3
    How are you specifying the smooth curve? It would help to have a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to see what your input looks like. – MrFlick Dec 25 '14 at 20:44
  • I'm having trouble understanding what plsmo is estimating and plotting. I would have imagined that you were describing a 1-d density: `densityplot(~age, groups=sex, data=dat)` for which the ggplot2 counterpart would be: `p <- ggplot( data=dat, aes( x=y, y=age, group=sex))+geom_violin(); print(p)` – IRTFM Dec 26 '14 at 01:06
  • `plsmo` is estimating the relationship between x and y using `lowess()` then computing elements of a high-resolution histogram for the distribution of `x` condition on the grouping variable and projecting the histogram onto the `lowess` curve(s). – Frank Harrell Dec 26 '14 at 03:17
  • I doubt you will able to achieve anything even close to this without creating your own custom function. I guess you could just modify your own `plsmo` to use use `ggplot`. `sat_smooth()` is already doing the loess part, All you left is to add the histogram just like you did in `plsmo` function – David Arenburg Dec 26 '14 at 09:48
  • 3
    Yes I have a new function that creates a layer to add to `ggplot()` - see https://github.com/harrelfe/rms/blob/master/R/ggplot.Predict.s. But this function has to be provided with redundant information already known to the `ggplot` object, and the function takes the already-smoothed data instead of the raw data. I've also created a new `geom` -- `geom_plsmo` -- to use the exceptionally fast `lowess()` but `geom_plsmo` does not add the histogram to the curves. – Frank Harrell Dec 26 '14 at 12:52
  • I have continued to enhance my function that calls `geom_segment`. It is at https://github.com/harrelfe/Hmisc/blob/master/R/histSpikeg.s. It is fully functional but does require passing some redundant information as arguments as it's not a real `geom`. – Frank Harrell Dec 29 '14 at 01:32
  • I can do it, but it is simpler to show using boxplot. May I show how to put boxplots along your curve? Vioplot uses boxplot-like notation. – EngrStudent Jan 22 '15 at 15:43
  • I really want the entire data distribution. Box plots do not capture isolated points or bimodality. Thanks. I've got `histSpikeg` fine-tuned now. – Frank Harrell Jan 22 '15 at 21:53

1 Answers1

1

I believe you can do this with the ggsubplot package. See the article and the package. I believe the code will look something like:

qplot(age, y, data = dataset, color = sex) + 
    geom_subplot(aes(x, y, data = distributions, group = sex, 
        subplot = geom_violin(aes(x, y, data = distributions))))

But I don't think your example provides enough detail in your example to create the violins at points along the curves. Unless I misunderstood your question.

joeyreid
  • 123
  • 1
  • 3
  • Thanks for the pointer to the excellent article which I read with interest. I haven't yet been able to figure out if subplots will allow me to coordinate point-by-point with the main layer, which is need to add things like spike histograms along existing plotted curves. I note that the article failed to reference Daniel Carr's work or thermometer plots. – Frank Harrell Feb 09 '15 at 17:10