ggplot2 histogram: how do I add textual annotation onto histogram bars using ggplot2

Question

I am working with a data frame that has the following header names:

> [1] "Filename" "Strain" "DNA_Source" "Locus_Tag" "Product" "Transl_Tbl" "Note" "Seq_AA" "Protein_ID"

Using the following code I get a graph that shows how many genes are found within a particular bacterial strain:

png(filename=paste('images/Pangenome_Histogram.png', sep=''), width=3750,height=2750,res=300)
par(mar=c(9.5,4.3,4,2))
print(h <- ggplot(myDF, aes(x=Strain, stat='bin', fill=factor(Filename), label=myDF$Filename)) + geom_bar() +
      labs(title='Gene Count by Strain Pangenome', x='Campylobacter Strains', y='Gene Count\n') +
      guides(title.theme = element_text(size=15, angle = 90)) + theme(legend.text=element_text(size=15), text = element_text(size=18)) +
      theme(axis.text.x=element_text(angle=45, size=16, hjust=1), axis.text.y=element_text(size=16), legend.position='none', plot.title = element_text(size=22)) )

Perhaps a bit hard to see, but for example, some strains have bars that are multi-colored -- indicating that some of the strain's genes are coming from sources other than the bacterial chromosome (or from several chromosomes if the bacteria has multiple chromosomes). I would like to label the bars according to the source of the genes ("DNA_Source") at the appropriate position.

png(filename=paste('images/Pangenome_Histogram.png', sep=''), width=3750,height=2750,res=300)
par(mar=c(9.5,4.3,4,2))
print(h <- ggplot(myDF, aes(x=Strain, stat='bin', fill=factor(Filename), label=myDF$Filename)) + geom_bar() +
      labs(title='Gene Count by Strain Pangenome', x='Campylobacter Strains', y='Gene Count\n') +
      guides(title.theme = element_text(size=15, angle = 90)) + theme(legend.text=element_text(size=15), text = element_text(size=18)) +
  geom_text(aes(label=DNA_Source, y='identity'), color='black', vjust=-5, size=4) +
      theme(axis.text.x=element_text(angle=45, size=16, hjust=1), axis.text.y=element_text(size=16), legend.position='none', plot.title = element_text(size=22)) )

This gets me close, but it removes the count from the y-axis (and adds the word "identity" on the lower left hand side) and it labels the contributions on top of each other so that they cannot be read unless it is the same word.

I would like have the y-axis labeled like the first image, with the labels in the second -- but I would like for those labels to appear within their corresponding color portion of the histogram (similar visually to here: Showing data values on stacked bar chart in ggplot2), but I would like to accomplish it using the ggplot2 package.

I hope this is clear. Help is appreciated -- so thanks in advance.

Here is a bit of data (tail(dput(myDF[c(2, 3, 5)])))...

          Strain DNA_Source                             Product
12299 Campy3194c    Plasmid Type VI secretion protein, VC_A0111
12300 Campy3194c    Plasmid           Type VI secretion protein
12301 Campy3194c    Plasmid                              Tgh104
12302 Campy3194c    Plasmid                        protein ImpC
12303 Campy3194c    Plasmid           Type VI secretion protein
12304 Campy3194c    Chromosome                           Tgh079

Can you either post a `dput(df)` of your data or post a MWE so that we have some data to play with! :) Also, a small tip: use `ggsave` instead of `png()`, its much easier — David, Nov 13 '15 at 16:00
Also, can you elaborate on how your question is different from the one you refer to, where Ramnath gives a perfectly clear answer? — David, Nov 13 '15 at 16:08
I could not figure out how to incorporate the labels using stat='identity' or y='identity' without getting an error (geom_text requires the following missing aesthetics: y) or removing my y-axis information and adding 'identity' to the axis. I will deal with the position of the labels once I have that nailed down first. — cer, Nov 13 '15 at 16:24
Also, I am not interested in labeling the histogram with frequency, but rather the source of the frequency. — cer, Nov 13 '15 at 16:31
The data that you showd, is that what you called `myDF`? Can you post some data that we can use without too much hassle, i.e. `dput(myDF)` — David, Nov 13 '15 at 16:34
Please put the code the other way around: `dput(tail(myDF[, c(2,3,5)]))`, and maybe also `summary(myDF[, c(2,3,5)])`. — David, Nov 13 '15 at 16:55
David -- I am appreciating your willingness to help, but dput(tail(myDF[, c(2,3,5)])) is almost 700 lines. You have the structure of my dataframe and some sample data. I am hoping you or someone else can address why my y-axis disappears when I I try to label the histogram without providing large swaths of data. — cer, Nov 13 '15 at 17:05

score 2 · Accepted Answer · answered Nov 13 '15 at 17:53

Say that you have a dataset that looks like this:

library(data.table)
library(ggplot2)
set.seed(123)
dna_src <- c("Chromosome", "Plasmid")
myDF <- data.table(Strain = c(rep("Campy3149c", 100),
                              rep("Campy31147q", 100)),
                   DNA_Source = c(sample(dna_src, size = 100, replace = T, 
                                    prob = c(0.9, 0.1)),
                                  sample(dna_src, size = 100, replace = T, 
                                    prob = c(0.7, 0.3))))
head(myDF)
#       Strain DNA_Source
#1: Campy3149c Chromosome
#2: Campy3149c Chromosome
#3: Campy3149c Chromosome
#4: Campy3149c Chromosome
#5: Campy3149c    Plasmid
#6: Campy3149c Chromosome

You can use data.table to collapse the data to a shorter data.table that has most of the information that we need, the only addition is the y-value for the label, which we calculate as follows:

dt <- myDF[, .(countStrain = .N), by = c("Strain", "DNA_Source")][order(Strain, DNA_Source)]

# add the y-values for the plot
dt[, yval := cumsum(countStrain) - 0.5 * countStrain, by = Strain]

Lastly, we plot the values

ggplot(dt, aes(x = Strain, y = countStrain, fill = DNA_Source)) + 
  geom_bar(stat = "identity") + 
  geom_text(data = dt, aes(x = Strain, y = yval, label = DNA_Source))

Which results in a plot like this:

Thank you @David. I have not been able to get around an unused argument error when I apply your solution to my data, but this looks like what I want. — cer, Nov 13 '15 at 20:12
Without knowing more about your data, I am not able to guide to the right solution for you. However, if my answer has helped you, please consider marking it as the answer and giving a point. — David, Nov 13 '15 at 22:29

ggplot2 histogram: how do I add textual annotation onto histogram bars using ggplot2

1 Answers1