1

This is my sample of a big data matrix & and each column has named with multiple information and separated by an underscore.

structure(list(Gene = c("AGI4120.1_UBQ", "AGI570.1_Acin"), WT_Tissue_0T_1 = c(0.886461437, 1.093164915), WT_Tissue_0T_2 = c(1.075140682, 1.229862834), WT_Tissue_0T_3 = c(0.632903012, 1.094003128), WT_Tissue_1T_1 = c(0.883151274, 1.26322126), WT_Tissue_1T_2 = c(1.005627276, 0.962729188), WT_Tissue_1T_3 = c(0.87123469, 0.968078993), WT_Tissue_3T_1 = c(0.723601456, 0.633890322), WT_Tissue_3T_2 = c(0.392585237, 0.534819363), WT_Tissue_3T_3 = c(0.640185369, 1.021934772), WT_Tissue_5T_1 = c(0.720291294, 0.589244505), WT_Tissue_5T_2 = c(0.362131744, 0.475251717), WT_Tissue_5T_3 = c(0.549486925, 0.618177919), mut1_Tissue_0T_1 = c(1.464415756, 1.130533457), mut1_Tissue_0T_2 = c(1.01489573, 1.114915728), mut1_Tissue_0T_3 = c(1.171797418, 1.399956009), mut1_Tissue_1T_1 = c(0.927507448, 1.231911575), mut1_Tissue_1T_2 = c(1.089705396, 1.256782289 ), mut1_Tissue_1T_3 = c(0.993048659, 0.999044465), mut1_Tissue_3T_1 = c(1.000993049, 1.103486794), mut1_Tissue_3T_2 = c(1.062562066, 0.883617224 ), mut1_Tissue_3T_3 = c(1.037404833, 0.851875438), mut1_Tissue_5T_1 = c(0.730883813, 0.437440083), mut1_Tissue_5T_2 = c(0.480635551, 0.298762126 ), mut1_Tissue_5T_3 = c(0.85468388, 0.614923997)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list( cols = list(Gene = structure(list(), class = c("collector_character", "collector")), WT_Tissue_0T_1 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_0T_2 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_0T_3 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_1T_1 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_1T_2 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_1T_3 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_3T_1 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_3T_2 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_3T_3 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_5T_1 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_5T_2 = structure(list(), class = c("collector_double", "collector")), WT_Tissue_5T_3 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_0T_1 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_0T_2 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_0T_3 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_1T_1 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_1T_2 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_1T_3 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_3T_1 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_3T_2 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_3T_3 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_5T_1 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_5T_2 = structure(list(), class = c("collector_double", "collector")), mut1_Tissue_5T_3 = structure(list(), class = c("collector_double", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))

I want to follow a Tukey Test and plot bar charts for each Gene (Response vs. Time; filled by both the genotypes) with multiple comparisons letters.

Syntax

df1 <- df %>% 
         gather(var, response, WT_Tissue_0T_1:mut1_Tissue_5T_3) %>% 
         separate(var, c("Genotype", "Tissue", "Time"), sep = "_") %>% 
         arrange(desc(Gene))
df2 <- df1 %>% 
         group_by(`Gene`,Genotype,Tissue,Time) %>% 
         mutate(Response=mean(response),n=n(),se=sd(response)/sqrt(n))

Two-way ANOVA

fit1 <- aov(Response ~ Genotype*Time, df2) 

Hereafter, I want to go proceed Tukey Test (Multiple Comparison), for e.g. Gene "AGI4120.1_UBQ", plot Response vs. Time, and see how each genotype (WT & mut1) behave in each Time point (0T, 1T, 3T, and 5T)? if the response is significantly different or not and denote by letters in the plot.

As below lsmeans syntax combine all the Genes to one and give output, how can I make it to loop for each gene separately (i.e. "AGI4120.1_UBQ", "AGI570.1_Acin") and get letters to show statistically distinct groups (aka "compact letter display")

 lsmeans(fit1, pairwise ~ Genotype | Time) 

My final aim is to plot each gene like this below graph and denote significance letters.

df2$genotype <- factor(df2$genotype, levels = c("WT","mut1")) colours <- c('#336600','#ffcc00')library(ggplot2)ggplot(df2,aes( x=Time, y=Response, fill=Genotype))+geom_bar(stat='identity', position='dodge')+scale_fill_manual(values=colours)+geom_errorbar(aes(ymin=average_measure-se, ymax=average_measure+se)+facet_wrap(~`Gene`)+labs(x='Time', y='Response')

Expecting Graph for denoting compact letter display

I would appreciate your kind help, if possible.

Dendrobium
  • 323
  • 5
  • 13
  • fit1 <- aov(Response ~ Genotype*Time, df2). NOTE: ANOVA should be done with raw data df1 because df2 is averaged replicate data which 've used to plot the graphs. – Dendrobium May 31 '18 at 11:39

1 Answers1

1

There are a number of issues with your code. I'd go so far as to say it's not really an appropriate post for StackOverflow, as the problems here are diverse and not really extendable beyond this specific constellation of bugs and syntax issues. But I'll post a few suggestions as an answer to get you started - hope they help.

1. lsmeans():
This function expects a fitted model (like from lm()), or a ref.grid object. But you are passing it a data frame, without any of the regression properties it needs to compute least-squares means. (Think: How does lsmeans() know what the dependent variable should be, when you ask for pairwise comparison with Genes as the independent variable?) Check out the Using lsmeans documentation for more details.

Based on your data, I'd say you probably want to run a multi-level regression, using something like the lme4 package, with Gene and Genotype and maybe Time as nested grouping levels.

But for demonstration I'll keep it simple with lm(). Passing a fitted regression object to lsmeans() works as intended:

fit <- lm(Response ~ Gene + Genotype + Time, data=df2)

lsmeans(fit, pairwise ~ Gene)[[2]] 

# Output
contrast                        estimate        SE df t.ratio p.value
AGI4120.1_UBQ - AGI570.1_Acin -0.0515123 0.0299492 42   -1.72  0.0928

Results are averaged over the levels of: Genotype, Time 

2. ggplot():
You haven't defined colours or average_measure in the code you've provided; calling these undeclared variables will cause failure.

Structurally, I would suggest you use df1 and allow ggplot to do the grouping, rather than group_by into df2. Then use stat="summary" and fun.y="mean" to accomplish the summarise() computations you did in df2. Doing this also allows you to use the mean_se function for your error bars. Like this:

ggplot(df1,aes( x=Time, y=response, fill=Genotype))+ 
  geom_bar(stat='summary', fun.y='mean', position=position_dodge(0.9))+
    stat_summary(fun.data = mean_se, geom = "errorbar", 
                 color="gray60", width=.1, position=position_dodge(0.9)) +
  scale_fill_manual(values=c("steelblue","orange"))+
  facet_wrap(~`Gene`)+ 
  labs(x='Time', y='Response')

facet plot

Finally, note that your use of separate() in df1 will throw a warning, although not an error, as there is an extra value created by splitting on sep="_". You can avoid that (if it is causing confusion) by adding a level to capture the final value (which appears to be a time index):

separate(var, c("Genotype", "Tissue", "Time", "Index"), sep = "_")
andrew_reece
  • 20,390
  • 3
  • 33
  • 58
  • 1
    Thanks a heap Andrew. Greatly appreciate your help. Sorry I've forgotten to add my color does in the questions. I wrote them separately. i.e. df2$genotype <- factor(df2$genotype, levels = c("WT","mut1")) colours <- c('#336600','#ffcc00') May I ask, how can I follow the Tukey Multiple comparison test and add significant letters to bar chart. Working df is a huge matrix. Thank you for your time, – Dendrobium May 30 '18 at 03:28
  • What do you mean by "significant letters"? Can you show an example of the kind of graph you want to make? – andrew_reece May 30 '18 at 05:19
  • I want to add letters for the bar charts, I see this is possible with this package, Multiple comparision letters, http://rcompanion.org/handbook/G_06.html another example https://stackoverflow.com/questions/18771516/is-there-a-function-to-add-aov-post-hoc-testing-results-to-ggplot2-boxplot Not sure How to use this. – Dendrobium May 30 '18 at 05:39
  • Ok, I understand now. The post you linked to here has a lot of code that you aren't using here, though, and the package you linked to has pretty extensive documentation. I think you will need to spend some time getting to understand the `multcomp` and `multcompView` packages first, and getting more familiar with how the syntax works and integrates with plotting functions. One thing that makes your case more difficult is that you have an extra grouping dimension in your data, compared to the example in the SO post you linked to. Good luck! – andrew_reece May 30 '18 at 05:56
  • 1
    @Oncidium FYI, the **lsmeans** package (now **emmeans**) can provide letters to show statistically distinct groups (aka "compact letter display") using the `cld()` function. See this used in [the vignette](https://cran.r-project.org/web/packages/emmeans/vignettes/comparisons.html#pairwise). – aosmith May 30 '18 at 15:18
  • 1
    Hi Adrew, I've edited the question. Sorry for misleading you. Your given lsmeans(fit, pairwise ~ Gene) output is not I'm expected. I wanna look Genotype responses vs time. Also I'm really need of the help with multcomp and multcompView. I can not handle by my self. Thanks again.! – Dendrobium May 31 '18 at 01:58