0

I have data looking like:

Sample  Replication Days
1   1   10
1   1   14
1   1   13
2   1   NA
2   1   5
2   1   18
1   2   16
1   2   NA
1   2   18
2   2   15
2   2   7
2   2   12

I want to add a column for the average values of Samples across the replications. I want to keep replication as a factor to see if there is any effect because of replication. For example, an average of Sample 1 for replication 1 and an average of Sample 1 for replication 2 separately. Then I want to use that column for ANOVA using:

sample_aov <- aov(Sample~Days, na.rm=TRUE)

I tried using aggregate but I think I am making a mistake. I will appreciate any help. Thanks!

Jessica
  • 391
  • 1
  • 3
  • 16
  • 3
    [See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making an R question that folks can help with. That includes a sample of data, not a picture of it, which we can't run code on – camille Dec 09 '19 at 16:23
  • Thank you, I have updated the question, @camille! – Jessica Dec 09 '19 at 16:29
  • 1
    So what's wrong with your `aov` results? And if you tried calling `aggregate` but made a mistake, what is it? Right now it's unclear what the question is, but there are a lot of posts already on calculating summary stats like mean by group – camille Dec 09 '19 at 16:31

2 Answers2

1

Using tidyverse, you can process your dataframe as this:

library(tidyverse)
df = data.frame(Sample = c(rep(1,3), rep(2,3),rep(1,3), rep(2,3)),
                Replication = c(rep(1,6), rep(2,6)),
                Days = c(10,14,13,NA,5,18,16,NA,18,15,7,12))

df <- df %>% group_by(Sample, Replication) %>% summarise(Mean = mean(Days, na.rm = TRUE))

And you get the following dataframe:

> df
# A tibble: 4 x 3
# Groups:   Sample [2]
  Sample Replication  Mean
   <dbl>       <dbl> <dbl>
1      1           1  12.3
2      1           2  17  
3      2           1  11.5
4      2           2  11.3

Now you can perform anova test on this dataframe by doing:

> aov(Mean ~ Sample, data = d)
Call:
   aov(formula = Mean ~ Sample, data = d)

Terms:
                  Sample Residuals
Sum of Squares  10.56250  10.90278
Deg. of Freedom        1         2

Residual standard error: 2.334821
Estimated effects may be unbalanced

As you only have two groups to compare, a t-test is more appropriate:

> t.test(Mean ~ Sample, data = df)

    Welch Two Sample t-test

data:  Mean by Sample
t = 1.392, df = 1.0026, p-value = 0.3962
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -26.23892  32.73892
sample estimates:
mean in group 1 mean in group 2 
       14.66667        11.41667 

Is it what you are looking for ?

dc37
  • 15,840
  • 4
  • 15
  • 32
  • Thank you, @dc37! Yes, this is similar to what I am looking for. But I have a question, I have shown to a few lines for my dataset. I have 100 samples over two replications. To make a data frame, do I need to enter all values manually as you did for df? – Jessica Dec 09 '19 at 18:11
  • Thank you, So now I have the data imported but after running df <- df %>% group_by(Sample, Replication) %>% summarise(Mean = mean(Days, na.rm = TRUE)) %>% mutate(AOV = ). I get an error: Error: object '' not found – Jessica Dec 09 '19 at 19:00
  • So sorry, I forget to remove the last part of my code (`mutate(AOV = )` when editing my answer. Try without this part, it should work now. – dc37 Dec 09 '19 at 19:04
  • It ran now but gave all 'NA' for the mean column. – Jessica Dec 09 '19 at 19:07
  • Does your `Days` columns is in a numerical format ? (check it with `str(df)`) – dc37 Dec 09 '19 at 19:08
  • You are right, Days was a factor earlier. So it made a column for the mean. But I compared it with the manually calculated numbers and these are different. Basically, there are 134 samples with 4 values for each but df says # A tibble: 135 x 3 – Jessica Dec 09 '19 at 19:27
  • Yes, I think you're right, when converting, it will assign he order levels as numerical values. I get this kind of errors quite frequently :(. Maybe you can try to first transform as a vector and then as numerical. `df$Days = as.vector(df$Days)` and then `df$Days = as.numeric(df$Days)` – dc37 Dec 09 '19 at 20:08
  • Thank you! After ANOVA, I calculated, `lsmeans(anova, ~Sample)`. Then I created means and SE, `Design<-lsmeans(anova, ~Sample)` `Design<-summary(anova)` Then I plotted it, `p<- ggplot(Design,aes(x=Sample,y=lsmean))+ geom_bar(stat = "identity", position=position_dodge(), colour="black", fill="grey", width = 1) + geom_errorbar(aes(x=Line, ymin = lsmean-SE, ymax = lsmean+SE), width = 0.25, color = "black") + (limits = c(0, 40)) But it says data should be dataframe. I want to plot it using confidence interval as error bars. Can you help. – Jessica Dec 10 '19 at 00:19
  • Glad that it works and you were able to continue your analysis ! Good luck – dc37 Dec 10 '19 at 00:22
  • Yes, But I had an issue in ggplot that I mentioned in the previous comment. – Jessica Dec 10 '19 at 00:55
  • The output of `summary(anova)` is a list, so you can plot it as it is with `ggplot`. You should ask a new question for that because we are going out of focus of your first question and need much more detailed information that what can be contain in comments – dc37 Dec 10 '19 at 01:01
1

Let us use your original suggestion to use aggregate. We will call your data.frame df. Noting that your grouping variables are sample and repl use:

> val <- aggregate(.~sample+repl, df, FUN=mean)
> val
  sample   repl          days
       1      1      12.33333
       2      1      11.50000
       1      2      17.00000
       2      2      11.33333

You are ready to perform your anova.

Vidhya G
  • 2,250
  • 1
  • 25
  • 28