1

I have some issues trying to plot multiple boxplots in single figure using ggplot, even though I can easily do that in vanilla R. I tried to google for answers for quite a few hours, but I couldn't find anyone who had same problem as me.

So basically, I have a data frame with 25 columns

[1] "gene"                    "NumberIntrons"          
[3] "NumResidues"             "protein_coding"         
[5] "ncRNA"                   "Rel_telomere"           
[7] "mRNA_copies_per_cell"    "protein_copies_per_cell"
[9] "mRNA.stabilities"        "GeneticDiversity"       
[11] "ProteinHalfLife"         "Golgi"                  
[13] "Mitochondrion"           "Nuclear_dots"           
[15] "Nuclear_envelope"        "Nucleolus"              
[17] "Nucleus"                 "Vacuole"                
[19] "start"                   "end"                    
[21] "solid.media.KO.fitness"  "gene.expression.RPKM"   
[23] "conservation.phyloP"     "chromosome"             
[25] "essential"  

Values in "Golgi", "Nucleolus", "Nuclear_envelope", "Mitochondrion", "Nuclear_dots", "Nucleus" and "Vacuole" are either 1 or 0, depending if protein expressed from the gene is found in that compartment or not.
So now my problem is that I want to compare the expression rates between proteins in different compartments. And I tackled that problem by creating different data frame for each compartment using subset() function in order to place all genes that are expressed in particular compartment in their specific dataframe. And in order to create a graph using basic R, I can simply run following command:

boxplot(log10(golgi.df$gene.expression.RPKM),
        log10(mito.df$gene.expression.RPKM),
        log10(nucleus.df$gene.expression.RPKM),
        log10(nuc_env.df$gene.expression.RPKM),
        ...

which lets me compare gene expression rate between proteins found in every compartment. Problem is, I simply have no idea how can I run the same command in ggplot, I'm not sure if I'm simply missing something super obvious, because I can't figure it out myself and I just couldn't find anyone who had similar problem as me.

Many thanks for any sort of help

Edit: data sample:

          gene NumberIntrons NumResidues protein_coding ncRNA Rel_telomere
1  SPAC1002.01             1         179              1     0    0.4768186
2  SPAC1002.02             0         229              1     0    0.4770079
3 SPAC1002.03c             0         923              1     0    0.4772343
4 SPAC1002.04c             0         199              1     0    0.4782177
5 SPAC1002.05c             0         715              1     0    0.4784627
  mRNA_copies_per_cell protein_copies_per_cell mRNA.stabilities
1                 0.46                      NA         38.30908
2                 3.80                 2564.51         32.34325
3                 7.20                 9719.87         47.75722
4                 1.40                 1600.25         26.73106
5                 0.52                      NA         20.38781
  GeneticDiversity ProteinHalfLife Golgi Mitochondrion Nuclear_dots
1        0.0024331              NA     0             1            0
2        0.0025389           177.6     0             0            0
3        0.0041078           959.8     0             0            0
4        0.0043796           756.3     0             0            0
5        0.0043057              NA     0             0            0
  Nuclear_envelope Nucleolus Nucleus Vacuole   start     end
1                0         0       0       0 1798347 1799015
2                0         0       0       0 1799061 1800053
3                0         0       0       0 1799915 1803141
4                0         0       0       0 1803624 1804491
5                0         0       1       0 1804548 1806797
  solid.media.KO.fitness gene.expression.RPKM conservation.phyloP
1               1.042607              6.94700           0.0000000
2               1.095876             88.64500           0.3780292
3               1.035408            103.73833           0.5133217
4                     NA             33.71667           0.5185620
5               1.056457              8.13000           0.4469873
  chromosome essential
1          I        NA
2          I        NA
3          I        NA
4          I         1
5          I        NA
Phil
  • 7,287
  • 3
  • 36
  • 66
Kesat Ola
  • 13
  • 3
  • can you create and share any sample data? – Death Metal Aug 25 '20 at 14:17
  • @DeathMetal I edited in some sample data into the main post, I'm not sure if that's the way I should have presented it, or is there better way to add it to the post? – Kesat Ola Aug 25 '20 at 14:22
  • 1
    Hello and welcome on SO :) Please see the following link to find guidelines on [how to make great R reproducible examples](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). For example, to produce a minimal data set, you can use `head()`, `subset()`. Then use `dput()` to give us something that can be put in R immediately. Alternatively, you can use base R datasets such as `mtcars`, `iris`, *etc*. – Paul Aug 25 '20 at 14:33
  • 1
    It is unclear what boxplots you looking for... so say in plot, it should have 7 bars representing 7 compartments, and each plot represents one gene? e.g. 10 genes = 10 plots times 7 bars? or are you mixing all the genes up and purely looking at the distribution of `gene.expression.RPKM` for all genes in each compartment? – Ben Toh Aug 25 '20 at 14:34
  • 1
    Using functions from `tidyr` and `dplyr` packages you could try to get a table with at least this info: `gene`, `gene.expression.RPKM` and another one with names of compartments. Then it will be very easy to make ggplots. Base `plot` and `ggplot` do not use the same type of table to make graphs. It wont be necessary to split your data btw. – Paul Aug 25 '20 at 14:39
  • @Paul thank you very much for all your help, I tried to use `dput()` command as you suggested, but I think my data is too big, because it still generated 100k characters, even though I used `head()`. I ran `dput()` command on first 100 rows and put it here (https://pastebin.com/2L0kAcpx) if you are still willing to help, its almost the same size as just `dput(head())` anyway, so thought why not put 100 rows instead. – Kesat Ola Aug 25 '20 at 15:12
  • @BenToh Sorry for not being clear, I'm trying to mix all the genes to their compartments, and just purely look at gene.expression.RPKM for all genes in each compartment. – Kesat Ola Aug 25 '20 at 15:13
  • @Paul what commands would you recommend using with `tidyr` `dplyr` ? I downloaded the packages but I'm completely in the dark. – Kesat Ola Aug 25 '20 at 15:17
  • @Kesat Ola, they are many very useful function in these packages. I often use `select()`, `filter()`, `group_by()`, `mutate()` and `pivot_longer()`. I found [this document](https://r4ds.had.co.nz/) very useful. – Paul Aug 26 '20 at 06:22

2 Answers2

2

@Em Laskey has the gist of the answer to your question. Here's just a little longer walkthrough.

Here I create a fake dataset without losing the generality.

set.seed(1234)
df <- data.frame(gene = rep(c("A", "B"), each = 5),
                 Mitochondrion = rbinom(10, 1, 0.5),
                 Golgi = rbinom(10, 1, 0.5),
                 expression = runif(10, 0, 100))
df
#   gene Mitochondrion Golgi expression
#1     A             0     1  31.661245
#2     A             1     1  30.269337
#3     A             1     0  15.904600
#4     A             1     1   3.999592
#5     A             1     0  21.879954
#6     B             1     1  81.059855
#7     B             0     0  52.569755
#8     B             0     0  91.465817
#9     B             1     0  83.134505
#10    B             1     0   4.577026

tidyr is used to convert your data table between "wide" and "long" format. It is more straightforward to show you than to explain what they are. The current table is considered "wide" and we use pivot_longer() function to make the wide table long:

df_long <- df %>%
  pivot_longer(c(Mitochondrion, Golgi),
               names_to = "Compartment",
               values_to = "Presence")
df_long
## A tibble: 20 x 4
#   gene  expression Compartment   Presence
#   <chr>      <dbl> <chr>            <int>
# 1 A          31.7  Mitochondrion        0
# 2 A          31.7  Golgi                1
# 3 A          30.3  Mitochondrion        1
# 4 A          30.3  Golgi                1
# 5 A          15.9  Mitochondrion        1
# 6 A          15.9  Golgi                0
# 7 A           4.00 Mitochondrion        1
# 8 A           4.00 Golgi                1
# 9 A          21.9  Mitochondrion        1
#10 A          21.9  Golgi                0
#11 B          81.1  Mitochondrion        1
#12 B          81.1  Golgi                1
#13 B          52.6  Mitochondrion        0
#14 B          52.6  Golgi                0
#15 B          91.5  Mitochondrion        0
#16 B          91.5  Golgi                0
#17 B          83.1  Mitochondrion        1
#18 B          83.1  Golgi                0
#19 B           4.58 Mitochondrion        1
#20 B           4.58 Golgi                0

So in the original wide format, each row contains information of a gene's presence in Mitochondrion and Golgi. But in long format, each row contains only a gene's presence in one compartment. As a result, the dataframe doubled its size (2 compartments here, in your case 7 compartments would be 7-times original size).

From your comment, you want to mix all genes up. So we don't really need to care about manipulation at the gene columns. And since we care only about where the gene is present, we can remove all observations with Presence == 0. We use dplyr package filter() function to achieve that:

df_long_pres <- df_long %>%
  filter(Presence == 1)
df_long_pres
## A tibble: 11 x 4
#   gene  expression Compartment   Presence
#   <chr>      <dbl> <chr>            <int>
# 1 A          31.7  Golgi                1
# 2 A          30.3  Mitochondrion        1
# 3 A          30.3  Golgi                1
# 4 A          15.9  Mitochondrion        1
# 5 A           4.00 Mitochondrion        1
# 6 A           4.00 Golgi                1
# 7 A          21.9  Mitochondrion        1
# 8 B          81.1  Mitochondrion        1
# 9 B          81.1  Golgi                1
#10 B          83.1  Mitochondrion        1
#11 B           4.58 Mitochondrion        1

Ok, so we can make the boxplots now. ggplot() is a little different from base R plots, you need to be aware that ggplot() uses dataframe. The original dataframe wouldn't work because at any one layer, ggplot() takes only one column as elements for x-axis, and one for y-axis. You want your Mitochondrion, Golgi... in x-axis but those information span across 7 columns. That's why we need to consolidate them into one column.

ggplot(df_long_pres) +
  geom_boxplot(aes(x = Compartment, y = log10(expression)))

Output plot

Ben Toh
  • 742
  • 5
  • 9
  • 1
    Appreciate very detailed response, I think I understood everything that was happening in console. Definitely very helpful for someone who is very green. – Kesat Ola Aug 25 '20 at 16:33
0

If you use ggplot and bring your data into a tidy form (i.e. long format), you can use easily use ggplot to plot several boxplots at once, as suggested in the comments.

I'm not sure if I completely understand what you're after, but if you want one boxplot per compartiment, combining the gene expression of all genes present in said compartiment, this is a possible solution (I didn't use all your data as it is quite an effort to get it into R in the way you present it):

# load packages
library(dplyr); library(tidyverse)

# transform your data into long format
dat_long <- dat %>%
  select(c(1, 12:14, 16)) %>% # select columns indicating whether a protein is present + the column indicating gene expression
  pivot_longer(-c(1, 5), names_to = "compartiment", values_to = "value") # transform into long format

# make gene expression numeric
dat_long$gene.expression.RPKM <- as.numeric(dat_long$gene.expression.RPKM)

# plot
ggplot(dat_long[dat_long$value != 0, ], aes(x=compartiment, y = gene.expression.RPKM)) + # plot gene expression, but only if the protein is present in a certain compartiment (i.e. value != 0)
  geom_boxplot() 

This results in a single boxplot, as only one gene is present in only one compartiment in the data you provide. If I manually set some other genes to 1, this results in one boxplot per compartiment, combining all genes experssed there:

# manually change presence of genes as example
dat_long[8, 4] <- 1
dat_long[3, 4] <- 1
dat_long[12, 4] <- 1
dat_long[5, 4] <- 1
dat_long[15, 4] <- 1

ggplot(dat_long[dat_long$value != 0, ], aes(x=compartiment, y = gene.expression.RPKM)) + # plot gene expression, but only if the protein is present in a certain compartiment (i.e. value != 0)
  geom_boxplot() 

If you want a boxplot for every gene on their own, you can use the gene-name as x-value and split compartiments with the command "facet_wrap(~ compartiment), but as this results in a single value per gene and compartiment I assume this is not what you want.

I hope this helps! Let me know if it is not what you wanted.

Data I used (copied parts of the data you provide):

# data I used:
> dput(dat)
structure(list(gene = structure(1:5, .Label = c("SPAC1002.01", 
"SPAC1002.02", "SPAC1002.03c", "SPAC1002.04c", "SPAC1002.05c"
), class = "factor"), NumberIntrons = c(1L, 0L, 0L, 0L, 0L), 
    NumResidues = c(179L, 229L, 923L, 199L, 715L), protein_coding = c(1L, 
    1L, 1L, 1L, 1L), ncRNA = c(0L, 0L, 0L, 0L, 0L), Rel_telomere = structure(1:5, .Label = c("0.4768186", 
    "0.4770079", "0.4772343", "0.4782177", "0.4784627"), class = "factor"), 
    mRNA_copies_per_cell = structure(c(1L, 4L, 5L, 3L, 2L), .Label = c("0.46", 
    "0.52", "1.4", "3.8", "7.2"), class = "factor"), protein_copies_per_cell = structure(c(NA, 
    2L, 3L, 1L, NA), .Label = c("1600.25", "2564.51", "9719.87"
    ), class = "factor"), mRNA.stabilities = structure(c(4L, 
    3L, 5L, 2L, 1L), .Label = c("20.38781", "26.73106", "32.34325", 
    "38.30908", "47.75722"), class = "factor"), GeneticDiversity = structure(c(1L, 
    2L, 3L, 5L, 4L), .Label = c("0.0024331", "0.0025389", "0.0041078", 
    "0.0043057", "0.0043796"), class = "factor"), ProteinHalfLife = structure(c(NA, 
    1L, 3L, 2L, NA), .Label = c("177.6", "756.3", "959.8"), class = "factor"), 
    Golgi = c(0L, 0L, 0L, 0L, 0L), Mitochondrion = c(1L, 0L, 
    0L, 0L, 0L), Nuclear_dots = c(0L, 0L, 0L, 0L, 0L), solid.media.KO.fitness = structure(c(2L, 
    4L, 1L, NA, 3L), .Label = c("1.035408", "1.042607", "1.056457", 
    "1.095876"), class = "factor"), gene.expression.RPKM = structure(c(3L, 
    5L, 1L, 2L, 4L), .Label = c("103.73833", "33.71667", "6.947", 
    "8.13", "88.645"), class = "factor"), conservation.phyloP = structure(c(1L, 
    2L, 4L, 5L, 3L), .Label = c("0", "0.3780292", "0.4469873", 
    "0.5133217", "0.518562"), class = "factor")), class = "data.frame", row.names = c(NA, 
-5L))
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Em Laskey
  • 508
  • 4
  • 15
  • 1
    Your code worked perfectly. I don't really fully understand how it works, but now I have clear direction on what I should read about to learn more. Many thanks. – Kesat Ola Aug 25 '20 at 16:26