1

I have looked at many similar questions (like this one), but in my case the treatment groups are not saved as separate vectors, and I haven't had any success substituting my variable names into any other code I've seen on this topic.

I want to compare means for "before" and "after" treatments for the same variable (test score) across multiple locations.

My data looks like this:

  > head(my.df, n=15)
             Location     TestScore Treatment
1            4            0.7167641 Before
2            4            0.7998261 Before
3            4            0.8165880 After
4            4            0.8078955 After
5            7            0.6993413 Before
6            7            0.8404255 Before
7            7            0.7803164 Before
8            7            0.8383867 After
9            7            0.7930419 After
10           8            0.8504963 Before
11           8            0.7734653 Before
12           8            0.8408432 After
13           8            0.7980454 After
14           8            0.8407756 After
15           8            0.7837427 After

Note that the number of "before" and "after" responses is different both within and between locations.

I know I can use the following code to compare the before and after treatments across ALL locations:

t.test(TestScore ~ Treatment, data = my.df, var.equal = FALSE)

However, I want to compare the before and after values for EACH location (since I have 100+ locations), not ALL locations at once. Ideally I could generate a list or table of p-values without having to write a new line of code each time. I thought I could do something simple like adding "group_by" like I've shown below:

my.df %>% group_by(Location) %>% do(tidy(t.test(TestScore ~ Treatment, data = my.df, var.equal = FALSE)

but when I run this code I get an output with the same p-value for every location (even though the data are different), as shown below:

# A tibble: 10 x 11
# Groups:   Location [10]
   Location estimate estimate1 estimate2 statistic  p.value parameter conf.low conf.high method                  alternative
   <fct>         <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl> <chr>                   <chr>      
 1 4            0.0587     0.972     0.913      15.0 1.60e-20      51.8   0.0508    0.0665 Welch Two Sample t-test two.sided  
 2 7            0.0587     0.972     0.913      15.0 1.60e-20      51.8   0.0508    0.0665 Welch Two Sample t-test two.sided  
 3 8            0.0587     0.972     0.913      15.0 1.60e-20      51.8   0.0508    0.0665 Welch Two Sample t-test two.sided  
 4 9            0.0587     0.972     0.913      15.0 1.60e-20      51.8   0.0508    0.0665 Welch Two Sample t-test two.sided  
 5 10           0.0587     0.972     0.913      15.0 1.60e-20      51.8   0.0508    0.0665 Welch Two Sample t-test two.sided  
 6 12           0.0587     0.972     0.913      15.0 1.60e-20      51.8   0.0508    0.0665 Welch Two Sample t-test two.sided  
 7 14           0.0587     0.972     0.913      15.0 1.60e-20      51.8   0.0508    0.0665 Welch Two Sample t-test two.sided  
 8 16           0.0587     0.972     0.913      15.0 1.60e-20      51.8   0.0508    0.0665 Welch Two Sample t-test two.sided  
 9 21           0.0587     0.972     0.913      15.0 1.60e-20      51.8   0.0508    0.0665 Welch Two Sample t-test two.sided  
10 27           0.0587     0.972     0.913      15.0 1.60e-20      51.8   0.0508    0.0665 Welch Two Sample t-test two.sided 

How can I get separate p-values comparing the before and after treatments for each location? Any help is greatly appreciated!

AJK
  • 15
  • 4

1 Answers1

0

You got most of the code correct, after the group_by, to work with data inside each group, you need to use data = . instead of 'data=my.df':

my.df %>% group_by(Location) %>% 
do(tidy(t.test(TestScore ~ Treatment, data = ., var.equal = FALSE)))

For example:

library(dplyr)
library(broom)

my.df = data.frame(Location=sample(c(4,7,8),100,replace=TRUE),
TestScore=rnorm(100,10,1),
Treatment=sample(c("Before","After"),100,replace=TRUE)

# A tibble: 3 x 11
# Groups:   Location [3]
  Location estimate estimate1 estimate2 statistic p.value parameter conf.low
     <dbl>    <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
1        4   0.660       10.0      9.38     1.74   0.0926      31.0   -0.116
2        7   0.191       10.2     10.0      0.620  0.541       24.7   -0.445
3        8  -0.0720      10.1     10.2     -0.198  0.844       32.0   -0.813
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thanks! Just for clarification, can you explain what the "data = ." is doing in this situation? Still fairly new to R. – AJK Apr 13 '20 at 22:06
  • the . is the magrittr dot, https://magrittr.tidyverse.org/reference/pipe.html. so for example if we are to write the do(..) properly, it would be do(function(x)tidy(...., data=x)) , and when you do dplyr etc, you can use . to replace the function(x) etc – StupidWolf Apr 13 '20 at 22:26
  • you can also check this out https://stackoverflow.com/questions/35272457/what-does-the-dplyr-period-character-reference . try writing the do () part using a normal function like in lapply etc, and you will see it's function – StupidWolf Apr 13 '20 at 22:27