2

I have a large matrix of 4775 rows of genes and 37 columns (individual samples). The data in each cell is the logCPM value for each gene.

Is it possible to create a boxplot of just a single row (gene) that I am interested in? I have a corresponding metadata dataframe which has groupings for the individual samples.

I have been trying something like this which is not working:

boxplot(logcpm~pd_2$PEDIS, data=logcpm["K11040",], ylab="LogCPM", xlab="PEDIS Group") 

If I do this code:

boxplot(logcpm["K11040",], ylab="LogCPM", xlab="PEDIS Group") 

I get something like this:

enter image description here

However this obviously doesn't seperate the values based on patient group. How do I plot the logcpm values from a matrix while also including the groupings from a separate metadata dataframe?

Example data of what I mean:

I have a matrix similar to the below structure:

1 2 3 4 5
gene1 0.3 0.4 0.8 1.3 3.3
gene2 0.0 0.1 0.8 0.9 0.9
gene3 0.4 0.6 0.9 0.1 0.1

And metadata dataframe:

Sample PEDIS
1 PEDIS1
2 PEDIS1
3 PEDIS2
4 PEDIS2
5 PEDIS2

What I want to do is create a boxplot of only gene2 but group the values based on the pedis column in my metadata dataframe

mrad
  • 173
  • 1
  • 11
  • Please make a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) or [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – Martin Gal Sep 08 '21 at 23:16
  • 3
    `boxplot(logcpm["K11040",] ~ pd_2$PEDIS)` ? – thelatemail Sep 08 '21 at 23:28
  • 2
    @thelatemail's answer looks correct to me, however I recommend https://bioinformatics.stackexchange.com/ for this type of domain-specific question. The users there may be more helpful with the 'big picture' aspects of your question, e.g. if the individual samples are grouped according to replicates, do you need to correct for batch effects? Have you normalised your logCPM values using e.g. voom (https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29)? Why are you using logCPM and not another 'counts' measure? Etc – jared_mamrot Sep 08 '21 at 23:34
  • 1
    Yes @thelatemail answer worked, if you want to post it as an answer I can tick that as fixing my problem. Thanks jared_mamrot I didn't know about the bioinformatics version of stackexchange. This data is post normalisation and batch correction with edgeR – mrad Sep 08 '21 at 23:36
  • 1
    While we're offering advice, I imagine analysis of this data would be easier in a 'long' dataset with columns `gene/sample/value` - then you could easily join your metadata info on by merging/joining on `sample`, and do boxplots using `boxplot(value ~ PEDIS, data[data$gene == 'x',])` – thelatemail Sep 08 '21 at 23:38

1 Answers1

3

This question has been answered in the comments by @thelatemail, but I've added some extra details for clarity.

One approach (as suggested by @thelatemail) is to subset the matrix to only include the gene you're interested in, e.g.

matrix_1 <- structure(c(0.3, 0, 0.4, 0.4, 0.1, 0.6, 0.8, 0.8, 0.9,
                   1.3, 0.9, 0.1, 3.3, 0.9, 0.1),
                 .Dim = c(3L, 5L),
                 .Dimnames = list(c("gene_1", "gene_2", "gene_3"),
                                  c("X1", "X2", "X3", "X4", "X5")))
matrix_1
#>         X1  X2  X3  X4  X5
#> gene_1 0.3 0.4 0.8 1.3 3.3
#> gene_2 0.0 0.1 0.8 0.9 0.9
#> gene_3 0.4 0.6 0.9 0.1 0.1

df_1 <- structure(list(Sample = 1:5,
                      PEDIS = c("PEDIS1", "PEDIS1", "PEDIS2", 
                                "PEDIS2", "PEDIS2")),
                 class = c("tbl_df", "tbl", "data.frame"),
                 row.names = c(NA, -5L))
df_1
#> # A tibble: 5 × 2
#>   Sample PEDIS 
#>    <int> <chr> 
#> 1      1 PEDIS1
#> 2      2 PEDIS1
#> 3      3 PEDIS2
#> 4      4 PEDIS2
#> 5      5 PEDIS2

# Boxplot
boxplot(matrix_1[2,] ~ df_1$PEDIS)

# Or, using the gene name, rather than the row number
boxplot(matrix_1["gene_2",] ~ df_1$PEDIS)

Created on 2021-09-09 by the reprex package (v2.0.1)

--

Another approach is to combine and transpose your data into the 'tidy' format (each row is a sample, each column is a variable), e.g.

df_2 <- as.data.frame(t(matrix_1))
df_2$group <- df_1$PEDIS
df_2
#>    gene_1 gene_2 gene_3  group
#> X1    0.3    0.0    0.4 PEDIS1
#> X2    0.4    0.1    0.6 PEDIS1
#> X3    0.8    0.8    0.9 PEDIS2
#> X4    1.3    0.9    0.1 PEDIS2
#> X5    3.3    0.9    0.1 PEDIS2

boxplot(gene_2 ~ group, data = df_2)

The tidy format facilitates the use of the ggplot2 package (for details on why this is a good thing, see: https://r-graphics.org/) which enables you to add 'features' to your plot, e.g. the individual logCPM values (dots) for each sample:

library(tidyverse)

df_2 %>%
  pivot_longer(cols = -group, names_to = "genes") %>%
  filter(genes == "gene_2") %>%
  ggplot(aes(x = group, y = value)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(aes(color = genes), width = 0.2) +
  theme_minimal()

You can also make more complicated figures, for example, comparing logCPM values for two genes 'side-by-side':

df_2 %>%
  pivot_longer(cols = -group, names_to = "genes") %>%
  filter(genes %in% c("gene_1", "gene_2")) %>%
  ggplot(aes(x = group, y = value)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(aes(color = genes), width = 0.2) +
  theme_minimal() +
  facet_wrap(~ genes)

Created on 2021-09-09 by the reprex package (v2.0.1)

jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
  • this is something I was looking for..thanks a ton..let me see if I can do the same ..its like choosing a gene or any gene from the dataframe and plotting it.. – PesKchan Feb 22 '22 at 15:23
  • 1
    If you are working with edgeR gene expression data @PesKchan there are many ways of plotting/visualising changes in gene expression. See e.g. https://combine-australia.github.io/RNAseq-R/06-rnaseq-day1.html for a step-by-step tutorial (I think the Glimma `glXYPlot()` is an excellent approach for looking at specific genes across all samples) – jared_mamrot Feb 22 '22 at 22:09
  • im using deseq2 ...for downstream i try to use ggplot2 for plotting ....I will post the question let me try it myself first and will come up – PesKchan Feb 23 '22 at 06:55