How to plot correlation graphs with R^2 for a big datamatrix?

Question

I have a proteomics data matrix. In the data matrix, I have detected a different number of peptides for each protein (detectable peptides numbers vary on the protein).

Q1. How can I plot correlation graphs for each protein to compare how its' peptides behave. i.e. For protein A, I have peptides a1-a3, I want to compare a1 vs a2, a1 vs a3, and a2 vs a3.

Sample data

structure(list(Protein = c("A", "A", "A", "A", "B", "C", "C", "D", "D", "D"), Peptide = c("a1", "a2", "a3", "a4", "b1", "c1", "c2", "d1", "d2", "d3"), Sample1 = c(0.275755732, 0.683048798, 1.244604878, 0.850270313, 0.492175199, 0.269651338, 0.393004954, 0.157966662, 1.681672581, 0.298308801), Sample2 = c(0.408992244, 0.172488244, 1.749247694, 0.358172308, 0.142129982, 0.158636283, 0.243500648, 0.095019037, 0.667928805, 0.572162278), Sample3 = c(0.112265765, 0.377174168, 2.430040623, 0.497873323, 0.141136584, 0.250330266, 0.249783164, 0.107188279, 0.173623439, 0.242298602), Sample4 = c(0.87688073, 0.841826338, 0.831376575, 0.985900966, 0.891632525, 1.016533723, 0.292048735, 0.776351689, 0.800070173, 1.161882923), Sample5 = c(1.034093889, 0.304305772, 0.616445765, 1.000820463, 1.03124071, 0.995897846, 0.289542364, 0.578721727, 0.672592766, 1.168944588), Sample6 = c(1.063124715, 0.623917522, 0.613196611, 0.990921045, 1.014340981, 0.965631141, 0.316793011, 1.02220535, 1.182063616, 1.41196421), Sample7 = c(1.335677026, 0.628621656, 0.411171453, 1.050563412, 1.290233552, 1.1603839, 0.445372411, 1.077192698, 0.726669337, 1.09453338), Sample8 = c(1.139360562, 0.404024829, 0.263714711, 0.899959209, 1.356913804, 1.246338203, 0.426568548, 1.104988267, 0.964924824, 1.083654341), Sample9 = c(1.38146599, 0.582817437, 0.783698738, 1.118948066, 1.010795866, 1.277086848, 0.434025911, 1.238871048, 1.201184368, 1.476478831), Sample10 = c(1.111486801, 0.60513273, 0.460680037, 1.385702246, 1.448873253, 1.364329784, 0.375032044, 1.382750002, 0.741842319, 1.035657705)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list( cols = list(Protein = structure(list(), class = c("collector_character", "collector")), Peptide = structure(list(), class = c("collector_character", "collector")), Sample1 = structure(list(), class = c("collector_double", "collector")), Sample2 = structure(list(), class = c("collector_double", "collector")), Sample3 = structure(list(), class = c("collector_double", "collector")), Sample4 = structure(list(), class = c("collector_double", "collector")), Sample5 = structure(list(), class = c("collector_double", "collector")), Sample6 = structure(list(), class = c("collector_double", "collector")), Sample7 = structure(list(), class = c("collector_double", "collector")), Sample8 = structure(list(), class = c("collector_double", "collector")), Sample9 = structure(list(), class = c("collector_double", "collector")), Sample10 = structure(list(), class = c("collector_double", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))

Hence peptide number varies for each protein, how can I compare each peptide and save the faceted graph into single plots, by this, I can select only required graphs.

what have you tried so far to resolve the problem? Perhaps, you can edit your post and show us what has already been tried. — mnm, Jun 28 '18 at 12:53
I'm unable to connect your expected output with your sample data. Where in your sample data are the quantities C_low/high? What are the numbers in the facet strips? Where is depth in your sample data? Please edit your question to provide expected output **based on your sample data**. — Maurits Evers, Jun 28 '18 at 12:55
Thanks Dr. Evers. No this plot is just added to show what I 'm expecting, but I can not do this since peptide number varies for each proteins. So some comparisons are useless (e.g. a1 vs b1). I wan to compare only the peptides of particular protein. I will remove this graph if this is confusing. — Dendrobium, Jun 28 '18 at 13:00
also see [here](https://stackoverflow.com/questions/10239497/ggplot2scatterplots-for-all-possible-combinations-of-variables) — Axeman, Jun 28 '18 at 13:00

Maurits Evers · Accepted Answer · 2018-06-29T12:09:26.710

2

"Hence peptide number varies for each protein, how can I compare each peptide and save the faceted graph into single plots, by this, I can select only required graphs." I'm not entirely sure what you actually want to plot. A correlation plot of which quantities? Select only which required graphs?

Anyway, perhaps the following will help.

library(GGally)
library(tidyverse)
df %>%
    gather(Sample, Value, -Protein, -Peptide) %>%
    spread(Peptide, Value) %>%
    filter(Protein == "A") %>%
    ggpairs(columns = 3:6)

Explanation: We reshape data such that we have Values for every Peptide in columns; then we filter entries for Protein == "A" and use GGally::ggpairs to show pairwise correlation plots of Values for every Peptide.

You have a lot of flexibility in customising the output plot of ggpairs (e.g. add regression lines, remove panels, etc.); I recommend taking a look at the GGally GitHub project page and at Multiple regression lines in ggpairs.

Update

If you want to show correlation plots only for certain Peptides, you could do the following

pep_of_interest <- c("a2", "a4")
df %>%
    gather(Sample, Value, -Protein, -Peptide) %>%
    spread(Peptide, Value) %>%
    filter(Protein == "A") %>%
    ggpairs(columns = match(pep_of_interest, colnames(.)))

edited Jun 29 '18 at 12:09

answered Jun 28 '18 at 13:41

Maurits Evers

49,617
4
47
68

2

try this to get all plots per protein in a list: `res <- d %>% gather(k,v, -Protein, -Peptide) %>% split(.$Protein) %>% map(~spread(.,Peptide, v)) %>% map(~select(.,-1:-2) %>% ggpairs(.))` – Roman Jun 28 '18 at 13:48
Thanks alot Dr. Evers! Appreciate as always :) – Dendrobium Jun 28 '18 at 23:58
As always, you're very welcome @Oncidium ;-) Good luck with your work! – Maurits Evers Jun 29 '18 at 00:05
@Dr. Evers, This package is awesome. Thanks for your answer. One quick question, my working df is large one. After reshape the data, what is the easiest way to find the location of my interested protein (peptides), i.e. here you select 3:6 ggpairs(columns = 3:6) ; how can I easily find where my location then! – Dendrobium Jun 29 '18 at 05:38
@Jimbou Thank you and this is very helpful. I have two Qs if you could help me with; Q1. How can I add protein name to the plots and save these plots separately in a folder? Q2. How can I gather plots that belong to one protein faceted and save? – Dendrobium Jun 29 '18 at 12:03
1

@Oncidium I've updated my answer with an example how to show correlation plots only for certain peptides of interest. Please take a look. – Maurits Evers Jun 29 '18 at 12:10

score 1 · Answer 2 · answered Jun 28 '18 at 14:29

Here is a solution using the corrplot library if you are looking for visual representation of correlation. A lot more plotting options are available in the library (take a look at the corrplot vignette).

# sample data
dd <- structure(list(Protein = c("A", "A", "A", "A", "B", "C", "C", "D", "D", "D"), Peptide = c("a1", "a2", "a3", "a4", "b1", "c1", "c2", "d1", "d2", "d3"), Sample1 = c(0.275755732, 0.683048798, 1.244604878, 0.850270313, 0.492175199, 0.269651338, 0.393004954, 0.157966662, 1.681672581, 0.298308801), Sample2 = c(0.408992244, 0.172488244, 1.749247694, 0.358172308, 0.142129982, 0.158636283, 0.243500648, 0.095019037, 0.667928805, 0.572162278), Sample3 = c(0.112265765, 0.377174168, 2.430040623, 0.497873323, 0.141136584, 0.250330266, 0.249783164, 0.107188279, 0.173623439, 0.242298602), Sample4 = c(0.87688073, 0.841826338, 0.831376575, 0.985900966, 0.891632525, 1.016533723, 0.292048735, 0.776351689, 0.800070173, 1.161882923), Sample5 = c(1.034093889, 0.304305772, 0.616445765, 1.000820463, 1.03124071, 0.995897846, 0.289542364, 0.578721727, 0.672592766, 1.168944588), Sample6 = c(1.063124715, 0.623917522, 0.613196611, 0.990921045, 1.014340981, 0.965631141, 0.316793011, 1.02220535, 1.182063616, 1.41196421), Sample7 = c(1.335677026, 0.628621656, 0.411171453, 1.050563412, 1.290233552, 1.1603839, 0.445372411, 1.077192698, 0.726669337, 1.09453338), Sample8 = c(1.139360562, 0.404024829, 0.263714711, 0.899959209, 1.356913804, 1.246338203, 0.426568548, 1.104988267, 0.964924824, 1.083654341), Sample9 = c(1.38146599, 0.582817437, 0.783698738, 1.118948066, 1.010795866, 1.277086848, 0.434025911, 1.238871048, 1.201184368, 1.476478831), Sample10 = c(1.111486801, 0.60513273, 0.460680037, 1.385702246, 1.448873253, 1.364329784, 0.375032044, 1.382750002, 0.741842319, 1.035657705)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list( cols = list(Protein = structure(list(), class = c("collector_character", "collector")), Peptide = structure(list(), class = c("collector_character", "collector")), Sample1 = structure(list(), class = c("collector_double", "collector")), Sample2 = structure(list(), class = c("collector_double", "collector")), Sample3 = structure(list(), class = c("collector_double", "collector")), Sample4 = structure(list(), class = c("collector_double", "collector")), Sample5 = structure(list(), class = c("collector_double", "collector")), Sample6 = structure(list(), class = c("collector_double", "collector")), Sample7 = structure(list(), class = c("collector_double", "collector")), Sample8 = structure(list(), class = c("collector_double", "collector")), Sample9 = structure(list(), class = c("collector_double", "collector")), Sample10 = structure(list(), class = c("collector_double", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))

# for Protein A, build subset of data
tempdd <- dd[dd$Protein == "A",][,-1]
cc <- tempdd[,1]
tempdd <- t(tempdd[,-1])
colnames(tempdd) <- cc

# calculate the correlations for all samples
rr <- cor(tempdd)

# install.packages("corrplot")
library(corrplot)

#Build the plot
corrplot(rr,method='circle')

How to plot correlation graphs with R^2 for a big datamatrix?

2 Answers2

Update