I often have to make plots which are essentially the same plot, but only for different variables and/or data frames:
p <- ggplot(data = data1, aes(x = variable1, y = ..density..)) +
geom_histogram(bins = 15, alpha = 0.2, position = "identity", aes(fill = groupvar1)) +
geom_density(size = 1, aes(color = groupvar1))
p <- ggplot(data = data1, aes(x = variable2, y = ..density..)) +
geom_histogram(bins = 15, alpha = 0.2, position = "identity", aes(fill = group2)) +
geom_density(size = 1, aes(color = group2))
p <- ggplot(data = data2, aes(x = variable3, y = ..density..)) +
geom_histogram(bins = 15, alpha = 0.2, position = "identity", aes(fill = group3)) +
geom_density(size = 1, aes(color = group3))
.
.
.
and so on. Instead than duplicating nearly identical code multiple times, I would like to write a single function which I can use with different data frames, variables to be plotted and with or without a grouping variable. Something like:
my_data <- data.frame(y = rnorm(100,0,1),z = runif(100,0,1),
group1 = rep(c("A","B"), each =50),
group2 = as.factor(rep(1:4, each =25)))
variable_distribution <- function(dataframe, myvar, groupvar = NULL) {
p <- ggplot(data = dataframe, aes(x = myvar, y = ..density..))
if (is.null(groupvar)) {
p <- p + geom_histogram(bins = 15, alpha = 0.2, position = "identity") +
geom_density(size = 1)
}
else {
p <- p + geom_histogram(bins = 15, alpha = 0.2, position = "identity", aes(fill = groupvar)) +
geom_density(size = 1, aes(color = groupvar))
}
print(p)
}
Some results:
variable_distribution(my_data, my_data$y, my_data$group1)
variable_distribution(my_data, my_data$z, my_data$group2)
There are several issues with my code:
- The labels are not what I would like them to be. In the first call, I would like the x-label to be
y
, instead thanmyvar
, and the legend title to begroup1
, instead thangroupvar
. In the second call, the x-label should bez
and the legend titlegroup2
. y
andgroup1
are parts ofmy_data
, it seems a bit redundant to pass them as two vectors, "separated" frommy_data
.
PS I don't want to address the variables by column number, because that makes the code much less readable. I'd like an interface such as
variable_distribution(my_data,y, group1)
or
variable_distribution("my_data", "y", "group1")
Or something like that...
EDIT: the solution in the linked questions just doesn't work, as someone might have noticed if he/she had actually tried to answer the question instead than concentrating on which question this one should be a duplicate of. Look:
variable_distribution <- function(dataframe, x_string, group_string = NULL) {
p <- ggplot(data = dataframe, aes_string(x = x_string, y = ..density..))
if (is.null(group_string)) {
p <- p + geom_histogram(bins = 15, alpha = 0.2, position = "identity") +
geom_density(size = 1)
}
else {
p <- p + geom_histogram(bins = 15, alpha = 0.2, position = "identity", aes(fill = group_string)) +
geom_density(size = 1, aes(color = group_string))
}
print(p)
}
variable_distribution(my_data, "y", "group1")
>Error in aes_string(x = x_string, y = ..density..) :
object '..density..' not found