0

I am working with a dataframe with many columns and would like to produce certain plots of the data using ggplot2, namely, boxplots, histograms, density plots. I would like to do this by writing a single function that applies across all attributes (columns), producing one boxplot (or histogram etc) and then storing that as a given element of a list into which all the boxplots will be chained, so I could later index it by number (or by column name) in order to return the plot for a given attribute.

The issue I have is that, if I try to apply across columns with something like apply(df,2,boxPlot), I have to define boxPlot as a function that takes just a vector x. And when I do so, the attribute/column name and index are no longer retained. So e.g. in the code for producing a boxplot, like

bp <- ggplot(df, aes(x=Group, y=Attr, fill=Group)) + 
  geom_boxplot() + 
  labs(title="Plot of length per dose", x="Group", y =paste(Attr)) + 
  theme_classic()

the function has no idea how to extract the info necessary for Attr from just vector x (as this is just the column data and doesn't carry the column name or index).

(Note the x-axis is a factor variable called 'Group', which has 6 levels A,B,C,D,E,F, within X.)

Can anyone help with a good way of automating this procedure? (Ideally it should work for all types of ggplots; the problem here seems to simply be how to refer to the attribute name, within the ggplot function, in a way that can be applied / automatically replicated across the columns.) A for-loop would be acceptable, I guess, but if there's a more efficient/better way to do it in R then I'd prefer that!

Edit: something like what would be achieved by the top answer to this question: apply box plots to multiple variables. Except that in that answer, with his code you would still need a for-loop to change the indices on y=y[2] in the ggplot code and get all the boxplots. He's also expanded-grid to include different ````x``` possibilities (I have only one, the Group factor), but it would be easy to simplify down if the looping problem could be handled.

I'd also prefer just base R if possible--dplyr if absolutely necessary.

Mobeus Zoom
  • 598
  • 5
  • 19
  • 2
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick May 05 '20 at 15:47

1 Answers1

3

Here's an example of iterating over all columns of a data frame to produce a list of plots, while retaining the column name in the ggplot axis label

library(tidyverse)

plots <- 
  imap(select(mtcars, -cyl), ~ {
    ggplot(mtcars, aes(x = cyl, y = .x)) + 
      geom_point() +
      ylab(.y)
  })

plots$mpg

enter image description here

You can also do this without purrr and dplyr

to_plot <- setdiff(names(mtcars), 'cyl')

plots <- 
  Map(function(.x, .y) {
    ggplot(mtcars, aes(x = cyl, y = .x)) + 
      geom_point() +
      ylab(.y)
  }, mtcars[to_plot], to_plot)

plots$mpg
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
  • Thanks for the idea. I'd prefer not to use ```purrr```, though, or ideally anything but base-R. (in conjunction with ```ggplot2``` obviously) – Mobeus Zoom May 05 '20 at 17:23
  • Thank you!! This is a good solution. Is there a straightforward way to extend it for density plots which compare two different datasets (over an identically named attribute)? i.e. ```ggplot() + geom_density(aes(x=Attr), fill="red", data=vec_from_dataset1, alpha=.5) + geom_density(aes(x=Attr), fill="blue", data=vec_from_dataset2, alpha=.5)``` where I could supply the names of ```dataset1``` and ```dataset2``` and they'd be indexed for attribute ```Attr``` -- and then this mapping be done to create list of the density plots over all (always identical-named) attributes, for the two dataframes? – Mobeus Zoom May 14 '20 at 23:09