1

Consider the 2 data frames created below:

#data1:
set.seed(123)
data1 <- data.frame(Loc = paste("Loc", seq(1:20), sep = ""),
                   A = sample(c(0,15,20,25,40),size = 20,replace = T, prob = c(45,25,15,10,5)),
                   B = sample(c(0,15,20,25,40),size = 20,replace = T, prob = c(45,25,15,10,5)),
                   C = sample(c(0,15,20,25,40),size = 20,replace = T, prob = c(45,25,15,10,5))
)
data1$D <- 100-(data1[,2]+data1[,3]+data1[,4])
data1$total <- sample(c(10:20), replace = T, length(data1[,1]))
#data2:
data2 <- data.frame(Loc = paste("Loc", seq(1:20), sep = ""),
          var1 = rnorm(20, mean = 1, sd = 1),
          var2 = rnorm(20, mean = 1, sd = 1),
          var3 = rnorm(20, mean = 1, sd = 1),
          var4 = rnorm(20, mean = 1, sd = 1),
          )

Assume that we took samples from 20 different locations which are represented by the Loc column in each data set. data1 contains clusters that observations were assigned to, represented as cluster A, B, and C and D, respectively. In data1, the values in the A, B, and C and D columns denote the percentage of observations that were assigned to each cluster from each respective Loc. For instance, there were 14 observations for Loc1, 25% of those observations were assigned to cluster B, and 75% were assigned to cluster D. The total column represents the total number of observations that were taken from each Loc. data2 contains the average values for variables that were used to create the clusters, all of which are on similar scales. Using the tidyverse framework, we can join observations for each Loc, and create a barplot showing the percent of observations from each Loc that were assigned to each cluster as follows:

library(ggplot2)
library(dplyr)
library(tidyr)
data2 <- left_join(data2,data1,by= c("Loc"))
data2
plotdat <- data2 %>%
   pivot_longer(-c(Loc,total,var1:var4), names_to= "Cluster", values_to = "val") %>%
   mutate(val1 = val * total / 100)
myplot<-
plotdat %>%
  ggplot(., aes(x=Loc, y=val1, fill = Cluster))+
  geom_bar(stat = "identity")+
  geom_text(aes(y = total, label = ifelse(Cluster == "A", total, "")), nudge_y = 1, size = 3) +
  geom_text(aes(y = val1, 
                label = ifelse(val > 0, scales::percent(val, scale = 1, accuracy = 1), "")), 
            position = position_stack(vjust = .6), size = 2)+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))+
  labs(x="Sample Location", y="Sample Size")
myplot

Results in this plot: enter image description here

I would like to know how we could use the data from the second data set data2 to add a small line above the each bar that shows the average value of the original variables (var1:4) that were used to produce the clusters (meaning for a given Loc, the average value for each var would be shown above that Loc's bar). I would like to connect the values that belong to the same variable with a line, with each variable having a unique colored line. What I am trying to do would look like this:

enter image description here

taken from this question: Plot line on top of stacked bar chart in ggplot2 except I want to make 4 different colored lines (one for each var..

Although they variables are on different scales from the "percents" we are plotting, we can just add 22 to each point:

data2 <- data2%>%
  pivot_longer(-c(Loc), names_to = "Var", values_to = "means")
data2$mu <-  + data2$means

But how do we add them to the top of the bars in myplot, and connect a line for the observations with a unique color?

Ryan
  • 1,048
  • 7
  • 14
  • So you want the values of `var1:4` to be located at the end of the bars with an errorbar? I think this doesn't make much sense because `var1:4` have different values, so it would be confusing to have them on the same level. On the other hand, the errorbars would be the same for all variables. Secondly, AFAIK `ggplot` doesn't allow different y scales, only transformations of scales. Maybe you have to split it into 2 plots. – starja Jul 18 '20 at 08:01
  • I think what you are trying to achieve would visually get quite crowded and confusing. I think in general visualisation should aid understanding the data, and not confuse? If you feel your outcome variable is more of a continuous nature, why not stick to only showing your means with standard errors? The other option would be to only show the bar graphs. But that's of course only my opinion. I am sure this is a dupe though. It's a very long question and I am not sure all this background is necessary for the main problem – tjebo Jul 18 '20 at 08:38
  • Does this answer your question? [How to stack error bars in a stacked bar plot using geom\_errorbar?](https://stackoverflow.com/questions/30872977/how-to-stack-error-bars-in-a-stacked-bar-plot-using-geom-errorbar) – tjebo Jul 18 '20 at 08:39
  • @starja please see my update, I have clarified what I am trying to accomplish – Ryan Jul 20 '20 at 00:32
  • @Ryan see my edit – starja Jul 20 '20 at 10:17

1 Answers1

2

You could use facet_grid, make 2 plots and arrange them on top of each other:

set.seed(123)
data1 <- data.frame(Loc = paste("Loc", seq(1:20), sep = ""),
                    A = sample(c(0,15,20,25,40),size = 20,replace = T, prob = c(45,25,15,10,5)),
                    B = sample(c(0,15,20,25,40),size = 20,replace = T, prob = c(45,25,15,10,5)),
                    C = sample(c(0,15,20,25,40),size = 20,replace = T, prob = c(45,25,15,10,5))
)
data1$D <- 100-(data1[,2]+data1[,3]+data1[,4])
data1$total <- sample(c(10:20), replace = T, length(data1[,1]))
#data2:
data2 <- data.frame(Loc = paste("Loc", seq(1:20), sep = ""),
                    val.var1 = rnorm(20, mean = 1, sd = 1),
                    val.var2 = rnorm(20, mean = 1, sd = 1),
                    val.var3 = rnorm(20, mean = 1, sd = 1),
                    val.var4 = rnorm(20, mean = 1, sd = 1),
                    se.var1 = rep(0.25, times = 20),
                    se.var2 = rep(0.25, times = 20),
                    se.var3 = rep(0.25, times = 20),
                    se.var4 = rep(0.25, times = 20))

library(ggplot2)
library(gridExtra)
library(dplyr)
library(tidyr)
plotdat <- data1 %>%
  pivot_longer(-c(Loc,total), names_to= "Cluster", values_to = "val") %>%
  mutate(val1 = val * total / 100)
plot1 <- plotdat %>%
  ggplot(., aes(x = Loc, y=val1, fill = Cluster))+
  facet_grid(cols = vars(Loc), scales = "free_x") + 
  geom_bar(stat = "identity")+
  geom_text(aes(y = total, label = ifelse(Cluster == "A", total, "")), nudge_y = 1, size = 3) +
  geom_text(aes(y = val1, 
                label = ifelse(val > 0, scales::percent(val, scale = 1, accuracy = 1), "")), 
            position = position_stack(vjust = .6), size = 2)+
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        legend.position = "bottom",
        strip.background = element_blank(),
        strip.text.x = element_blank(),)+
  labs(x="Sample Location", y="Sample Size")

plotdat2 <- data2 %>% 
  pivot_longer(-Loc, names_to = c(".value", "variable"),
               names_sep = "\\.") %>% 
  mutate(min = val - se,
         max = val + se)
plot2 <- plotdat2 %>% 
  ggplot(., aes(x = variable, y = val)) +
  facet_grid(cols = vars(Loc), scales = "free_x") +
  geom_point() +
  geom_errorbar(aes(ymin = min, ymax = max)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
        axis.title.x = element_blank())

grid.arrange(plot2, plot1, ncol = 1, nrow = 2)

enter image description here


Edit

With the following code, you can add a line plot to the bar plot. I use 2 different datasets, because of this you need to specify the aes for every layer separately. Because the x-axis is categorical, you need to specify the group argument in geom_line. However, I strongly discourage the use of this graph, as the lines have a totally different unit from the bars.

set.seed(123)
data1 <- data.frame(Loc = paste("Loc", seq(1:20), sep = ""),
                    A = sample(c(0,15,20,25,40),size = 20,replace = T, prob = c(45,25,15,10,5)),
                    B = sample(c(0,15,20,25,40),size = 20,replace = T, prob = c(45,25,15,10,5)),
                    C = sample(c(0,15,20,25,40),size = 20,replace = T, prob = c(45,25,15,10,5))
)
data1$D <- 100-(data1[,2]+data1[,3]+data1[,4])
data1$total <- sample(c(10:20), replace = T, length(data1[,1]))
#data2:
data2 <- data.frame(Loc = paste("Loc", seq(1:20), sep = ""),
                    val.var1 = rnorm(20, mean = 1, sd = 1),
                    val.var2 = rnorm(20, mean = 1, sd = 1),
                    val.var3 = rnorm(20, mean = 1, sd = 1),
                    val.var4 = rnorm(20, mean = 1, sd = 1),
                    se.var1 = rep(0.25, times = 20),
                    se.var2 = rep(0.25, times = 20),
                    se.var3 = rep(0.25, times = 20),
                    se.var4 = rep(0.25, times = 20))

library(ggplot2)
library(dplyr)
library(tidyr)
plotdat <- data1 %>%
  pivot_longer(-c(Loc,total), names_to= "Cluster", values_to = "val") %>%
  mutate(val1 = val * total / 100)

plotdat2 <- data2 %>% 
  pivot_longer(-Loc, names_to = c(".value", "variable"),
               names_sep = "\\.") %>% 
  mutate(val = val + 22)


ggplot(plotdat)+
  geom_bar(aes(x = Loc, y=val1, fill = Cluster), stat = "identity")+
  geom_text(aes(x = Loc, y = total, label = ifelse(Cluster == "A", total, "")), nudge_y = 1, size = 3) +
  geom_text(aes(x = Loc, y = val1, 
                label = ifelse(val > 0, scales::percent(val, scale = 1, accuracy = 1), "")), 
            position = position_stack(vjust = .6), size = 2)+
  geom_line(data = plotdat2, mapping = aes(x = Loc, y = val, colour = variable,
                                           group = variable)) +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        legend.position = "bottom",
        strip.background = element_blank(),
        strip.text.x = element_blank(),)+
  labs(x="Sample Location", y="Sample Size")

enter image description here

starja
  • 9,887
  • 1
  • 13
  • 28