1

I did a quantitative proteomics experiment to measure the differential expression of proteins in cells between two conditions. The output is a list of peptides, the protein they map to, and the their abundance for the experimental and control condition. Each protein has several detected peptides, and I need to pull out the median peptide abundance per protein, per condition into a new data frame. A simple version is as follows below:

gene peptide condition 1 abundance condition 2 abundance
protein 1 A 1 4
protein 1 B 2 5
protein 2 A 3 6
protein 2 B 3.5 7
protein 2 C 5

Is there a way to write code for this in R? Note that I have about 6000 proteins, and about 60,000 detected peptides. Not all peptides were detected in both condition 1 and 2, but I would still need to take the median of all peptides per protein for each condition separately.

The goal is to do statistical analysis between the median peptide abundance for each protein so I can see if the values are significantly different.

Thanks in advance!

Sean77
  • 13
  • 2
  • These should help: https://stackoverflow.com/q/9723208/3358272, https://stackoverflow.com/q/21982987/3358272. – r2evans May 07 '21 at 20:37

1 Answers1

1

Update to bonus question: to remove proteins with only one peptide use this code:

df %>% 
  group_by(gene) %>% 
  summarize(across(starts_with("condition"), median), count= n_distinct(peptide)) %>% 
  filter(count !=1) %>% 
  select(-count)

I have added a protein C with two peptides A in the new dataframe:

df <- tibble::tribble(
        ~gene, ~peptide, ~condition.1.abundance, ~condition.2.abundance,
  "protein 1",      "A",                      1,                     4L,
  "protein 1",      "B",                      2,                     5L,
  "protein 2",      "A",                      3,                     6L,
  "protein 2",      "B",                    3.5,                     7L,
  "protein 2",      "C",                     NA,                     5L,
  "protein 3",      "A",                     NA,                     5L,
  "protein 3",      "A",                     NA,                     5L
  )

Output:

  gene      condition.1.abundance condition.2.abundance
  <chr>                     <dbl>                 <dbl>
1 protein 1                   1.5                   4.5
2 protein 2                  NA                     6  

First answer Here is a solution with dplyr package with across (thanks to r2evans :-) :

df %>% 
  group_by(gene) %>% 
  summarize(across(starts_with("condition"), median))

and without across

library(dplyr)
df %>% 
  group_by(gene) %>% 
  summarize(median_coindition.1.abundance = median(condition.1.abundance), 
            median_coindition.2.abundance = median(condition.2.abundance))

Output:

  gene      median_coindition.1.abundance median_coindition.2.abundance
  <chr>                             <dbl>                         <dbl>
1 protein 1                           1.5                           4.5
2 protein 2                          NA                             6  

data:

df <- structure(list(gene = c("protein 1", "protein 1", "protein 2", 
"protein 2", "protein 2"), peptide = c("A", "B", "A", "B", "C"
), condition.1.abundance = c(1, 2, 3, 3.5, NA), condition.2.abundance = c(4L, 
5L, 6L, 7L, 5L)), row.names = c(NA, -5L), class = c("tbl_df", 
"tbl", "data.frame"))
TarJae
  • 72,363
  • 6
  • 19
  • 66
  • This is great! I should be able to figure out how to enter my data correctly into this (though I am not very experienced). One concern si that since I have 48,264 rows (the number of peptides) I couldn't possibly manually enter in the peptide and protein names as per the data section. Would there be a way to automate this to the same end? – Sean77 May 07 '21 at 21:13
  • please see my edit. I should now remove all proteins that have only one peptide. – TarJae May 07 '21 at 21:26
  • Wonderful! I did have the one last question about protein and peptide entry into the data section (edited the first comment I made) if it is no trouble. You have helped me so much, so please let me know if there is any way I can pay back the help you have given me here. – Sean77 May 07 '21 at 21:44
  • I think I almost have it working but there is one last thing (hopefully). To import the data I created a dataframe from a CSV using the read_csv function, then select() to select the relevant columns into a new data frame with all the values I need. I can run the code without the filter steps successfully, but when I add the filter steps I get the following error: Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘select’ for signature ‘"tbl_df" – Sean77 May 07 '21 at 22:58