I know how I can do all that for individual variables but I need to report this information for a large number of variables and would like to know if there is an efficient way to do this.
-
1Is the output of `summary` on an appropriate `lm` or `glm` model sufficient? – James Feb 21 '12 at 13:56
-
not really. What I'm looking for is a function that gives me a table like that: Variabe / Mean(SD) for group 1 / Mean(SD for group 2/ p-value for mean group difference – Mike Feb 21 '12 at 13:58
-
Can you expand your question with example data and output? – James Feb 21 '12 at 14:03
4 Answers
The tables
package makes everything in this except the p-values easy, and the p-values are doable. Here is a quick example:
> library(tables)
> iris2 <- iris[ iris$Species != 'versicolor', ]
> iris2$Species <- factor(iris2$Species)
> tmp <- tabular( Petal.Width+Petal.Length + Sepal.Width+Sepal.Length ~ Species* (mean+sd), data=iris2 )
>
> tmp.p <- sapply( names(iris2)[1:4], function(x) t.test( iris2[[x]] ~ iris2$Species )$p.value )
>
> tmp
setosa virginica
mean sd mean sd
Petal.Width 0.246 0.1054 2.026 0.2747
Petal.Length 1.462 0.1737 5.552 0.5519
Sepal.Width 3.428 0.3791 2.974 0.3225
Sepal.Length 5.006 0.3525 6.588 0.6359
> tmp2 <- cbind(tmp, tmp.p)
> colnames(tmp2) <- c('Setosa Mean','Setosa SD', 'Virginica Mean','Virginica SD',
+ 'P-value')
> tmp2
Setosa Mean Setosa SD Virginica Mean Virginica SD P-value
Sepal.Length 0.246 0.1053856 2.026 0.2746501 3.966867e-25
Sepal.Width 1.462 0.173664 5.552 0.5518947 4.570771e-09
Petal.Length 3.428 0.3790644 2.974 0.3224966 9.269628e-50
Petal.Width 5.006 0.3524897 6.588 0.6358796 2.437136e-48
#### Edit ####
It looks like newer versions of tabular do more checks which makes the cbind
approach not work any more (and this could be a good thing, since I am not sure that it was properly matching the values if the ordering was different). I did not find a simple way to still do this using cbind
(though you could convert to a matrix, pad the rows for the headers, then cbind
).
Here is another approach that works, it is still a bit of a kludge since it hardcodes the species variable in the function (and the function would therefore have to be updated specifically for each table it is used in):
library(tables)
iris2 <- iris[ iris$Species != 'versicolor', ]
iris2$Species <- factor(iris2$Species)
P.value <- function(x) t.test(x ~ iris2$Species)$p.value
tmp <- tabular( Petal.Width+Petal.Length + Sepal.Width+Sepal.Length ~ Species* (mean+sd) + P.value, data=iris2 )
tmp

- 48,497
- 6
- 83
- 110
-
Interestingly, I just tried your code, and it does not seem to be working (using tables_0.7.64). Also tried it on my data, same error: "Error in cbind(deparse.level, ...) : Cannot cbind if tables have different rows." Any further advice how to add columns to an object of the class tabular? – aae Jul 16 '14 at 19:45
-
@WilliamBligh, I am seeing the same thing as you. Apparently the tables package has be updated to do more checks which breaks my original solution (but protects as well). I have added a different approach above. Thanks for the catch. – Greg Snow Jul 17 '14 at 14:24
-
Perfect. I nearly figured it out yesterday, but for some stupid reason I put my p.val and W.stats functions for a Wilcoxon test inside the brackets with (mean+sd), wondering why the values weren't the same as in the apply-call. Stupid me, but I think others can learn from this error. – aae Jul 17 '14 at 17:16
-
Hm, this went well for simple tables, but I can't figure out how to do it with more complicated data. I opened a new thread on this [here](http://stackoverflow.com/q/24889784/1866440). – aae Jul 22 '14 at 14:15
In a data object like that offered by Alexander:
aggregate( . ~ Group, FUN=function(x) c(mn=mean(x), sd=sd(x)), data=Data[-1])
# Output
Group V1.mn V1.sd V2.mn V2.sd
1 1 0.05336901 0.85468837 0.06833691 0.94459083
2 2 -0.01658412 0.97583110 -0.02940477 1.11880398
V3.mn V3.sd V4.mn V4.sd
1 -0.2096497 1.1732246 0.08850199 0.98906102
2 0.0674267 0.8848818 -0.11485148 0.90554914
The data argument omits the ID column because you only want the results on the data columns. The request for a collection of p-values can be accomplished with:
sapply(names(Data)[-(1:2)], function(x) c(
Mean.Grp1 = mean(Data[Data$Group==1,x]),
Mean.Grp2 = mean(Data[Data$Group==2,x]),
`p-value`= t.test(Data[Data$Group==1, x],
Data[Data$Group==2,x])$p.value )
)
#---------------------------
V1 V2 V3 V4
Mean.Grp1 0.05336901 0.06833691 -0.2096497 0.08850199
Mean.Grp2 -0.01658412 -0.02940477 0.0674267 -0.11485148
p-value 0.70380932 0.63799544 0.1857743 0.28624585
If you wanted to add the SD's to that output the strategy seems obvious. You should note the back-quoting of the "p-value" name. Minus signs are syntactically "active" and would get interpreted as functions if not enclosed in quotes.

- 258,963
- 21
- 364
- 487
-
-
-
How would you scale this solution if a group had more levels, e.g. 12? Also, in what way does `summarize` from the dplyr package fail to meet this question requirements? – jiggunjer Jan 09 '19 at 09:10
-
@jiggunjer: Sorry, don't understand. There are two answers. The first one should not have any difficulty with multiple levels in the `Group` variable. The second one obviously could not use a t-test, since that is for two groups. Are you asking how to do some sort of chi-square or F test? I have no idea why you are asking a second question about the `dplyr` package, since I don't think the `dplyr` package existed in 2012. – IRTFM Jan 10 '19 at 00:27
-
Sorry I wasn't aware of the timeline, disregard the dyplr comment. But I was referring to the second example, i.e. multiple t-tests, so group 1 vs 3, 1 vs 4, 2 vs 3, etc. I think such a multiple comparision is a pretty common thing to do, so I was wondering how this would scale for that. – jiggunjer Jan 10 '19 at 04:17
-
Yes it’s common but it’s also wrong, at least as commonly practiced. Read up on the problem of multiple comparisons. – IRTFM Jan 10 '19 at 06:33
First let's make some example data. For each sample, we have a unique ID, its experimental group, and some variables for which we want to calculate the mean and SD.
## Make a data frame called "Data" with five columns
Data <- as.data.frame(cbind(1:100, sample(1:2), rnorm(100), rnorm(100), rnorm(100), rnorm(100)))
names(Data) <- c("ID", "Group", "V1", "V2", "V3", "V4")
## Now we will take a peak at the top of our data frame
> head(Data)
ID Group V1 V2 V3 V4
1 1 2 0.3681539 -0.5008400 1.2060665 -0.7352376
2 2 1 -0.1043180 2.2038190 -1.4367898 2.1961246
3 3 2 -0.2720279 -0.5923554 -1.4628190 -1.8776453
4 4 1 -2.3299662 -0.1216227 0.4200776 1.5504020
5 5 2 -0.3670578 -1.5903221 -0.6287083 -1.0543262
6 6 1 0.4840047 -0.3181554 -1.4596980 -0.4261827
Now we can run a for loop through the variables, pull their means and SDs and p values, and dump them all in an object called "Results".
## Define object which will receive our results
Results <- NULL
Results <- as.data.frame(Results)
## Open for loop
for (i in 3:6) {
## Run the t.test() and save it in a temporary object
temp <- t.test(Data[which(Data$Group == 1), i], Data[which(Data$Group == 2), i])
## Put the name of our variable in our results object
Results[i-2,1] <- names(Data)[i]
## Group 1 mean and SD
Results[i-2,2] <- temp$estimate[1]
Results[i-2,3] <- sd(Data[which(Data$Group == 1), i])
## Group 2 mean and SD
Results[i-2,4] <- temp$estimate[2]
Results[i-2,5] <- sd(Data[which(Data$Group == 2), i])
## P value for difference
Results[i-2,6] <- temp$p.value
rm(temp)
}
Now we can make our results pretty and print them.
## Add column names
names(Results) <- c("Variable", "Group.1.Mean", "Group.1.SD", "Group.2.Mean", "Group.2.SD", "P.Value")
## View our results
> Results
Variable Group.1.Mean Group.1.SD Group.2.Mean Group.2.SD P.Value
1 V1 0.21544390 0.9404104 -0.01426226 1.0570324 0.2537820
2 V2 0.26287585 1.0048291 0.22992285 0.9709686 0.8679038
3 V3 -0.06112963 0.9855287 0.17423440 1.0198694 0.2434507
4 V4 0.33848678 0.9360016 0.07905932 0.9106595 0.1632705

- 977
- 1
- 13
- 29