1

I have a situation where I have data distributed between two dataframe, and I need to subset the data from one of the dataframes first, and then conduct a t-test between this subset data and the (entire) data from the other dataframe.

I attempted to use %>% and group_by() to select the data I want, and then I tried to invoke the t-test as shown below.

library(dplyr)
a <- c("AA","AA","AA","AB","AB","AB")
b <- c(1,2,3,1,2,3)
c <- c(12,34,56,78,90,12)
cols1 <- c("SampID", "Reps", "Vals")
df1 <- data.frame(a,b,c)
colnames(df1) <- cols1
df1

  SampID Reps Vals
1     AA    1   12
2     AA    2   34
3     AA    3   56
4     AB    1   78
5     AB    2   90
6     AB    3   12

e <- c(1,2,3,4,5,6,7,8,9)
f <- c(11,22,33,44,55,66,77,88,99)
cols2 <- c("CtrlReps","CtrlVals")
df2 <- data.frame(e,f)
colnames(df2) <- cols2
df2

  CtrlReps CtrlVals
1        1       11
2        2       22
3        3       33
4        4       44
5        5       55
6        6       66
7        7       77
8        8       88
9        9       99

df1 %>%
  group_by(SampID) %>%
  t.test(Vals, df2$CtrlVals, var.equal = FALSE)

This, however, returns an error:

Error in match.arg(alternative) : 
  'arg' must be NULL or a character vector

I also tried using do but that returns an error as well:

outputs <- df1 %>%
  group_by(SampID) %>%
  do(tpvals = t.test(Vals, df2$CtrlVals, data = ., paired = FALSE, var.equal = FALSE)) %>%
  summarise(SampID, pvals = tpvals$p.value)

Error in t.test(Vals, df2$CtrlVals, data = ., paired = FALSE, var.equal = FALSE) : 
  object 'Vals' not found

I am new to R, and I have exhausted my Google-Fu, so I have no idea what is happening. To the best of my knowledge, these two errors are unrelated, I think but resolving one or the other gives me a way out of the situation. I just don't know how. I am also sure that resolving this problem would immediately land me in the next problem (the one this post actually addresses).

Your inputs/guidance/help would be much appreciated!

Dunois
  • 1,813
  • 9
  • 22

1 Answers1

1

Your attempt with do was close, it can be fixed by doing:

outputs <- df1 %>%
    group_by(SampID) %>%
    do(tpvals = t.test(.$Vals, df2$CtrlVals, 
                       paired = FALSE, var.equal = FALSE)) %>%
    summarise(SampID, pvals = tpvals$p.value)

You need .$Vals to get at the Vals column within do, it doesn't work quite the same way as mutate. The data argument for t.test also isn't useful here as you don't have both variables in the same dataframe so you can't put them both in a formula.

Result:

> outputs
# A tibble: 2 x 2
  SampID pvals
  <fct>  <dbl>
1 AA     0.253
2 AB     0.862
Marius
  • 58,213
  • 16
  • 107
  • 105
  • Hi @Marius, thank you so much for your help! I tried this as you suggested, and it worked flawlessly. In the interest of learning, may I ask why this code snippet works, even though the variable in it is not called with the `$`? `df <- data.frame(x=abs(rnorm(50)),id1=rep(1:5,10), id2=rep(1:2,25)) df <- tbl_df(df) res <- df %>% group_by(id1) %>% do(w = wilcox.test(x~id2, data=., paired=FALSE)) %>% summarise(id1, Wilcox = w$p.value)` Edit: struggling with the editing, but I hope it is readable. Took this from [here](https://stackoverflow.com/a/34581586/9494044). – Dunois Nov 02 '18 at 00:25
  • That snippet works because you pass `data = .`, so the `wilcox.test` function can then look up the variables in the formula within the appropriate dataframe. – Marius Nov 02 '18 at 00:33
  • So the do snippet from my code would also work if it had been set up the same way (minus the variable being called from the other dataframe)? – Dunois Nov 02 '18 at 00:38
  • Yes, if you call `t.test()` with `data = .`, you should be able to just put your variable names in a formula like `Vals ~ Group`, without having to do `.$Vals`. – Marius Nov 02 '18 at 00:41
  • I modified that code snippet with a `t.test` but it returns an `object not found` error. `df <- data.frame(x=abs(rnorm(10)),y=abs(rnorm(50)),id1=rep(1:5,10),id2=rep(1:2,25)) df <- tbl_df(df) res <- df %>% group_by(id1) %>% do(w = t.test(x, y, data=., paired=FALSE)) %>% summarise(id1, Wilcox = w$p.value)` – Dunois Nov 02 '18 at 00:50
  • I think the data argument only works when you are using a formula, like `t.test(x ~ group, ...)`. It doesn't work when you pass separate `x` and `y` vectors. – Marius Nov 02 '18 at 00:52