2

I have a vector of means and standard deviations, and I would like to plot the densities corresponding to these means and standard deviations in the same plot using ggplot2. I used mapply and gather to solve this problem, but it's quite a lot of lines of code for something which I think should be trivial:

library(dplyr)
library(tidyr)
library(ggplot2)

# generate data
my_data <- data.frame(mean =  c(0.032, 0.04, 0.038, 0.113, 0.105, 0.111),
                      stdev = c(0.009, 0.01, 0.01, 0.005, 0.014, 0.006), 
                      test = factor(c("Case_01", "Case_02", "Case_03", "Case_04",
                                      "Case_05", "Case_06")))

# points at which to evaluate the Gaussian densities
x <- seq(-0.05, 0.2, by = 0.001)

# build list of Gaussian density vectors based on means and standard deviations
pdfs <- mapply(dnorm, mean = my_data$mean, sd = my_data$stdev, MoreArgs = list(x = x),
               SIMPLIFY = FALSE)

# add group names
names(pdfs) <- my_data$test

# convert list to dataframe
pdfs <- do.call(cbind.data.frame, pdfs)
pdfs$x <- x

# convert dataframe to tall format
tall_df <- gather(pdfs, test, density, -x)

# build plot
p <- ggplot(tall_df, aes(color = test, x = x, y = density)) +
  geom_line() +
  geom_segment(data = my_data, aes(color = test, x = mean, y = 0, 
                                   xend = mean, yend = 100), linetype = "dashed") +
  coord_cartesian(ylim = c(-1, 100))
print(p)

enter image description hereThis is very similar to:

Plot multiple normal curves in same plot

and as a matter of fact, the accepted answer uses mapply, so that confirms me that I'm on the right track. However, what I don't like of that answer is that it hard-codes means and standard deviations in the mapply call. This won't work in my use case, because I read the real data from disk (of course, in the MRE I skipped the data reading part for simplicity). Is it possible to simplify my code, without sacrificing readability, and without hard-coding the mean and standard deviation vectors in the mapply call?

EDIT maybe the call to mapply may be avoided by using the package mvtnorm, but I don't think that affords any real simplification here. Most of my code comes after the call to mapply.

DeltaIV
  • 4,773
  • 12
  • 39
  • 86

1 Answers1

3

You can save some coding using purrr::pmap_df, which does the row binding automatically after constructing a data frame for each mean-stdev pair:

Assume my_data has the input columns in the order or mean, stdev, test and test is of character class.

library(purrr)
tall_df2 <- pmap_df(my_data, ~ data_frame(x = x, test = ..3, density = dnorm(x, ..1, ..2)))

With data:

my_data <- data.frame(mean =  c(0.032, 0.04, 0.038, 0.113, 0.105, 0.111),
                      stdev = c(0.009, 0.01, 0.01, 0.005, 0.014, 0.006), 
                      test = c("Case_01", "Case_02", "Case_03", "Case_04", "Case_05", "Case_06"), 
                      stringsAsFactors = F)

Plot:

p <- ggplot(tall_df2, aes(color = factor(test), x = x, y = density)) + 
      geom_line() +
      geom_segment(data = my_data, aes(color = test, x = mean, y = 0, 
                                       xend = mean, yend = 100), linetype = "dashed") +
      coord_cartesian(ylim = c(-1, 100))

print(p)

gives:

enter image description here

Psidom
  • 209,562
  • 33
  • 339
  • 356
  • cool! Could you explain what's the goal of `pmap_df`? I'm not familiar with `purrr` – DeltaIV Nov 09 '17 at 15:07
  • 1
    Also, I'm not sure why you had to use the `..i` sintax. For sure, `tall_df2 <- pmap_df(my_data, ~ data_frame(x = x, test = test, density = dnorm(x, mean, stdev)))` doesn't work, thus your syntax is the right one. But I don't understand why – DeltaIV Nov 09 '17 at 15:17
  • 2
    `pmap_df` is sort of a combination of `pmap` + `bind_rows`, and `pmap` does similar things to `mapply` but with different syntax. So instead of doing `mapply(f, a, b, c)` you do `pmap(list(a, b, c),f)`, this is convenient if you already have the data together, in your case a data frame. And finally it calls `bind_rows` for you to save you the `do.call(..., rbind)`. For more about syntax, see `?pmap`. If you want to match arguments by name, you need to construct a function explicitly `pmap_df(my_data, function(mean, stdev, test) data_frame(x = x, test = test, density = dnorm(x, mean, stdev)))` – Psidom Nov 09 '17 at 15:29
  • 1
    amazing! I will use the match by name approach - more readable, IMO, and also doesn't make assumptions about the order of columns. Since in the real use case I read data from disk, I cannot be sure about the column order - but I know the column names. – DeltaIV Nov 09 '17 at 15:32
  • The `x` argument should probably be replaced by the sequence in the question (`seq(-0.05, 0.2, by = 0.001)`) for clarity. – Mikko Sep 15 '21 at 08:07