R dplyr how to summarize

Question

Essentially, I am trying to make a series of plots with log2 fold-change on the y-axis and mean counts on the y-axis (the observation is genes). These are commonly called MA plots. The issue I am having is getting my data into the right form. I can do this through a loop, but would like to know the right way to do it.

At this point, I have two data frames: my design matrix and my data matrix. The design matrix looks like so (call it ED_df):

SampleID     Patient Grade Batch
MD48L_2_B_L1    MD48    G2 Feb15
MD48R_3_B_L1    MD48    G3 Feb15
MD53L_2_B_L1    MD53    G2 Feb15
MD53R_3_B_L1    MD53    G3 Feb15
MD58L_2_B_L1    MD58    G2 Sep15
MD58R_3_B_L1    MD58    G3 Sep15

dim(ED_df)
# [1] 18 6

Each row is a unique sample. Each sample comes from patient+grade+batch. In this case, all patients are paired around grade (G2 or G3). There are 3 total batches. Two patients were replicated across either batch 1 and 2 or batch 2 and 3.

My data matrix looks like so (call it data_df):

      Gene          MD48L_2_B_L1 MD48R_3_B_L1 MD53L_2_B_L1 MD53R_3_B_L1 MD58L_2_B_L1
1 ENSG00000000003    364.26079    329.28730    531.52188    371.67413   275.745038
2 ENSG00000000005     18.92264     49.89201     42.18428     19.42548     1.948728
3 ENSG00000000419    270.59373    261.65590    284.74386    414.41018   293.283591
4 ENSG00000000457    145.70432    125.28439    122.33440    129.50318   148.103342

dim(data_df)
# [1] 31707 18

Each column corresponds to a unique sample.

What I am wanting to do is to get, for each gene, a log2 fold-change (G3/G2) within each patient-batch set. Additionally, I want to get mean (G3, G2) for each patient-batch set.

I will then plot this as an MA plot.

Again, I can see how to do this painfully through a nested for-loop, what I would like to do is figure out how to do this via some sort of aggregating function.

Tidy that data! Use `tidyr::gather` to turn your data matrix into a data frame with columns `Gene`, `SampleID`, and `Value` columns, then join to `ED_df` to get the Grade, Batch, and Patient columns. — Gregor Thomas, Oct 06 '15 at 00:34
Thanks. I think what you are saying is get it in long format with the factor information as additional columns, correct? In this format, gene would be a factor as well I suppose. At that point, you would do a summaryBy Value~Gene+Patient+Batch+Grade with FUN = sum or / ? — Bob Settlage, Oct 06 '15 at 00:52

bramtayl · Answer 1 · 2015-10-06T01:52:54.473

1

Two more steps: spread grade so G2 and G3 end up in different columns, then summarize. I'm not sure if I exactly understood the aggregation process you wanted, but I've taken a stab. I included the psych package for the gm (geometric mean) function. This is important when working with ratio data.

library(dplyr)
library(tidyr)
library(psych)

data_df %>%
  as.data.frame %>%
  gather(sample, measurement, -gene) %>%
  left_join(ED_df) %>%
  spread(Grade, measurement) %>%
  group_by(Patient, Batch) %>%
  summarize(G2_geometric_mean = G2 %>% gm,
            G3_geometric_mean = G3 %>% gm) %>%
  mutate(geometric_mean_ratio = G3_geometric_mean / G2_geometric_mean)

edited Oct 06 '15 at 01:52

answered Oct 06 '15 at 01:47

bramtayl

4,004
2
11
18

Hi, thanks. I think this is what I was looking for. It will take me a bit to digest it. Here is what I ended up doing in my un-R-ish way: tidy_df<-gather(tidy_df,gene) colnames(tidy_df)<-c("gene","SampleName","value") tidy_df<-merge(tidy_df,slim_expt_1[,c(1:3,5)],by="SampleName",all.x=T) temp_mean<-aggregate(value~ gene + Patient + Batch, tidy_df, FUN=mean) temp<-tidy_df[order(tidy_df$gene,tidy_df$Patient,tidy_df$Grade,tidy_df$Batch),] temp_log2FC<-log2(temp[temp$Grade=="G3",3]/temp[temp$Grade=="G2",3]) temp_mean$log2FC<-temp_log2FC – Bob Settlage Oct 06 '15 at 02:04
I can't seem to add a return in comment mode. I hate my solution b/c it assumes things are ordered and pair up nicely etc, so I will give your solution a go after I study for my inference final. =/ – Bob Settlage Oct 06 '15 at 02:06

R dplyr how to summarize

1 Answers1

Linked