1

My question is similar to those posted here and here.

I am working on creating a graph in ggplot where I have one bar plot and then want to overlay multiple line graphs. For the purposes of this question, I have reproduced my code for two barplots (one that includes all years (2007-2015) and two from specific years (2007 and 2015), but ultimately I will be overlaying data from 10 different years. The data used can be found here.

library(dplyr)
library(tidyr)
library(gridExtra)
library(ggplot2)

overallpierc<-data[(data$item=="piercing"),]

overp<-overallpierc %>%
  group_by(age) %>% 
  count(sex) %>% 
  ungroup %>% 
  mutate(age = factor(age)) %>%
  complete(age, sex, fill = list(n = 0)) %>% 
  ggplot(aes(age, n)) + geom_col(aes(fill = sex), position = "dodge") +
    theme_classic() + 
    scale_fill_manual(values=c("#000000", "#CCCCCC"), name = "Sex") + 
    labs(x = "Age", y = "Number of observations") +   
    theme(legend.position=c(0.4,0.8),
    plot.title = element_text(size = 10),
    legend.title=element_text(size=15),
    axis.title=element_text(size=15),
    legend.key.size = unit(1.13, "cm"),
    legend.direction="vertical",
    legend.text=element_text(size=15))

p07<-data[(data$yy=="2007") & (data$item=="piercing"),]
summary(p07)

subp07<-p07 %>%  
  group_by(age) %>% 
  count(sex) %>% 
  ungroup %>% 
  mutate(age = factor(age)) %>%
  complete(age, sex, fill = list(n = 0)) %>% 
  ggplot(aes(age, n)) + geom_col(aes(fill = sex), position = "dodge") +
    theme_classic() + 
    scale_fill_manual(values=c("#000000", "#CCCCCC"), name = "Sex") + 
    labs(x = "Age", y = "Number of observations") +   
    theme(legend.position=c(0.4,0.8),
    plot.title = element_text(size = 10),
    legend.title=element_text(size=15),
    axis.title=element_text(size=15),
    legend.key.size = unit(1.13, "cm"),
    legend.direction="vertical",
    legend.text=element_text(size=15))

p15<-data[(data$yy=="2015") & (data$item=="piercing"),]

subp15<-p15 %>% 
  group_by(age) %>% 
  count(sex) %>% 
  ungroup %>% 
  mutate(age = factor(age)) %>%
  complete(age, sex, fill = list(n = 0)) %>% 
  ggplot(aes(age, n)) + geom_col(aes(fill = sex), position = "dodge") +
    theme_classic() + 
    scale_fill_manual(values=c("#000000", "#CCCCCC"), name = "Sex") + 
    labs(x = "Age", y = "Number of observations") +   
    theme(legend.position=c(0.4,0.8),
    plot.title = element_text(size = 10),
    legend.title=element_text(size=15),
    axis.title=element_text(size=15),
    legend.key.size = unit(1.13, "cm"),
    legend.direction="vertical",
    legend.text=element_text(size=15))

grid.arrange(overp, subp07, subp15)

The code I have posted gives me the following figure. enter image description here

What I am trying to do is plot the frequencies for females in 2007 and 2015 and males in 2007 and 2015 on top of the barplot for total frequencies (where this is also reflected in the legend). Is there a way to do that in R using ggplot2?

UPDATE: I tried using the geom_smooth and geom_line functions to add the lines to my ggplot as suggested in the comments and as other solutions to users questions, but I get the following error:

Error: Discrete value supplied to continuous scale

I created a new data frame for a subset that I would like to plot:

df<-data.frame(age=c(15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,40,50,60), val=c(0,5,13,77,70,106,62,51,46,27,46,16,22,16,14,48,21, 3,4))

And then added it to the ggplot code:

overallpierc %>%
  filter(age != "15") %>% 
  group_by(age) %>% 
  count(sex) %>% 
  ungroup %>% 
  mutate(age = factor(age)) %>%
  complete(age, sex, fill = list(n = 0)) %>% 
  ggplot(aes(age, n)) +     
    geom_line(data=df,aes(x=as.numeric(age),y=val),colour="blue") +
    geom_col(aes(fill = sex), position = "dodge") +
    theme_classic() + 
    scale_fill_manual(values=c("#000000", "#CCCCCC"), name = "Sex") + 
    labs(x = "Age", y = "Number of observations") +   
    theme(legend.position=c(0.4,0.8),
    plot.title = element_text(size = 10),
    legend.title=element_text(size=15),
    axis.title=element_text(size=15),
    legend.key.size = unit(1.13, "cm"),
    legend.direction="vertical",
    legend.text=element_text(size=15))

Others have encountered similar issues and used as.numeric to solve the problem. However, age needs to be treated as a factor for the purposes of plotting.

Blundering Ecologist
  • 1,199
  • 2
  • 14
  • 38
  • Could you simply add geom_smooth to your ggplot based on a dataframe with the value for each age being the number of observations? – Will Aug 27 '17 at 20:14
  • True, but I was hoping to learn how to code a more sophisticated solution instead of having to create a separate dataframe each time since I keep encountering this problem in my dissertation. – Blundering Ecologist Aug 27 '17 at 20:19
  • 2
    I had a similar problem for my dissertation and I defined a function that operated on a dataframe and produced the required resultant dataframe. In calling ggplot components I set the data argument to be the function of my dataframe, i.e + geom_smooth(data=aggregatingFunction(df),aes ...) – Will Aug 27 '17 at 20:39
  • Would you consider using stacked bars? That would give you frequencies by sex and total frequency in the same bar. Then you simply facet by year. I'll post an example if that sounds useful. – neilfws Aug 28 '17 at 05:08
  • @neilfws The stacked bars might be okay for adding 1-2 years, but I think it will be much too hard visually to tell the difference when I need to plot 10 years (all with relatively similar outputs/frequencies). – Blundering Ecologist Aug 28 '17 at 13:28
  • @WilliamAshford I tried using the code you suggested, but I am getting error messages. I updated my question above to reflect the problems I am encountering. If you have any suggestions, please let me know. – Blundering Ecologist Sep 02 '17 at 18:03
  • Could you expand on why you need to use factor for plotting? @BlunderingEcologist – Will Sep 04 '17 at 14:34
  • @WilliamAshford I need to use `factor` for plotting because if I keep `age` as `numeric`, it creates this huge space between 30, 40, 50, and 60 (see the plots below (in neilfws's answer) to see what I mean). – Blundering Ecologist Sep 04 '17 at 15:37
  • 1
    @BlunderingEcologist looking at https://stackoverflow.com/questions/16350720/using-geom-line-with-x-axis-being-factors might I suggest: `geom_point(data=fun(data),aes(x=age, y=nObs, group=1),stat='summary', fun.y=sum) + stat_summary(fun.y=sum, geom="line")` – Will Sep 05 '17 at 11:00

1 Answers1

3

Based on our discussion in the comments, let's try stacked bars and facets. I think it works but you can decide for yourself.

The stacked bar has the advantage of showing both proportions and total count in the same bar. To compare years, a facet grid places years in rows, so the eye can scan downwards to compare the same age in different years. Note that I kept age as a continuous variable here, rather than a factor.

library(dplyr)
library(ggplot2)
data30g %>% 
  count(yy, sex, age) %>% 
  ggplot(aes(age, n)) + 
    geom_col(aes(fill = sex)) + 
    facet_grid(yy ~ .) + 
    theme_bw() + 
    scale_fill_manual(values = c("#000000", "#cccccc"))

enter image description here

Not bad - I can see straight away, for example, an increase in both total and female count at age 30 over time, but perhaps a little small and crowded.

We can use a facet wrap instead of a grid to make the bars clearer, but at the expense of quick visual comparison across years.

data30g %>% 
  count(yy, sex, age) %>% 
  ggplot(aes(age, n)) + 
    geom_col(aes(fill = sex)) + 
    facet_wrap(~yy, ncol = 2) + 
    theme_bw() + 
    scale_fill_manual(values = c("#000000", "#cccccc"))

enter image description here

One more example which does not address your question in terms of total counts or barplots - but I thought it might be of interest. This code generates a "heatmap" style of plot which is poor for quantitative comparison, but can sometimes give a quick visual impression of interesting features. I think it shows, for example, that females aged 20 in 2014 have the highest total count.

data30g %>% 
  count(yy, sex, age) %>% 
  ggplot(aes(factor(age), yy)) + 
    geom_tile(aes(fill = n)) + 
    facet_grid(sex ~ .) + 
    scale_fill_gradient2() + 
    scale_y_reverse(breaks = 2006:2015) + 
    labs(x = "age", y = "Year")

enter image description here

EDIT:

Based on further discussions in the comments, here is one way to plot age as a factor, using bars for sexes, overlaid with a line for the totals and split by year.

overallpierc %>% 
  count(yy, sex, age) %>% 
  ggplot() + 
    geom_col(aes(factor(age), n, fill = sex), position = "dodge") +
    stat_summary(aes(factor(age), n), fun.y = "sum", geom = "line", group = 1) + 
  facet_grid(yy ~ .)

enter image description here

neilfws
  • 32,751
  • 5
  • 50
  • 63
  • This wasn't exactly what I was trying to achieve, but it might have to suffice if there is no way to overlay multiple line graphs onto a bar plot using `ggplot2`. Thank you for your help! – Blundering Ecologist Sep 01 '17 at 03:38
  • When I coerce the `age` variable to be a factor instead of numeric, I encounter problems when trying to add additional lines using the `geom_smooth` and `geom_line` functions. Might you know a work around for this? (I updated my question above.) – Blundering Ecologist Sep 02 '17 at 18:04
  • I think the issue is that you changed `age` to numeric in the `geom_line()` part of your code, but not in the `ggplot()` part. – neilfws Sep 03 '17 at 22:30
  • Yes, I imagine that must be a problem. But, I am trying to keep `age` as a factor in the `ggplot()` part since I don't want to have large spaces between 30, 40, 50, and 60. But, it appears that `age` needs to be `numeric` in the `geom_line()` part. Is there a way to work around that? i.e. to not have those big spaces between 30, 40, 50, and 60 that show up when `age` is `numeric`? – Blundering Ecologist Sep 04 '17 at 15:40
  • In your new code example, the sum of M + F is given by the total height of the stacked bars. I'm not clear why a line is necessary. However: I have edited my original answer with some code which uses `stat_summary` to overlay the line on the bars, as per your original question. Personally, I don't think lines on top of bars is a good visualisation technique. – neilfws Sep 04 '17 at 23:06
  • The idea I had was to use 1 figure/graph with the bar plot for the overall data and the lines for each separate year (i.e. 10 lines, 1 for 2006, 2007, 2008, etc). This way I don't have 10 different graphs, but 1 graph that shows the trends. Might you have any suggestions about how to do this in a way that is visually appealing? – Blundering Ecologist Sep 05 '17 at 15:04