Summary stats by factor level for multiple variables

Question

I want to produce dataframes containing summary statistics for each factor level for multiple variables.

For example if I have the following dataframe

Factor <- c("A","A","A","B","B","B")
Variable1 <- c(3,4,5,4,5,3)
Variable2 <- c(7,9,14,16,10,10)
mydf <- data.frame(Factor, Variable1, Variable2)
mydf
  Factor Variable1 Variable2
1      A         3         7
2      A         4         9
3      A         5        14
4      B         4        16
5      B         5        10
6      B         3        10

and I have the following function that I want to use to produce my summary stats:

my.summary <- function(x, na.rm=TRUE){result <- c(n=as.integer(length(x)),
Mean=mean(x, na.rm=TRUE), SD=sd(x, na.rm=TRUE), SeM = SEM(x),
Median=median(x),   Min=min(x), Max=max(x))}

To apply this to factor levels of Variable1 I can do this:

ddply(mydf, c("Factor"), function(x) my.summary(x$Variable1))
  Factor n Mean SD       SeM Median Min Max
1      A 3    4  1 0.5773503      4   3   5
2      B 3    4  1 0.5773503      4   3   5

Now I can do the same for Variable 2:

ddply(mydf, c("Factor"), function(x) my.summary(x$Variable2))

Which is easy enough if I just have 2 variables. However, if I had lots of variables this would be a pain. So how can I solve this so that I can produce a dataframe of the summary stats for each variable/factor level without having to adjust the code?

I have tried using aggregate.data.frame but it doesn't work using my.summary. It works using summary but produces one big data frame.

Thanks

`melt` your data into long format so you have `variable` column and a `value` column, then use both `"Factor"` and `"variable"` as grouping variables. — Gregor Thomas, Nov 23 '15 at 16:46

score 3 · Answer 1 · answered Nov 23 '15 at 16:49

You could use summarise_each from dplyr:

library(dplyr)

mydf %>% group_by(Factor) %>%
         summarise_each(funs(my.summary(.)))

After modifying your function to return a list:

my.summary <- function(x, na.rm=TRUE){result <- list(c(n=as.integer(length(x)),
                                                  Mean=mean(x, na.rm=TRUE), SD=sd(x, na.rm=TRUE),
                                                  Median=median(x),   Min=min(x), Max=max(x)))}

Sam Dickson · Accepted Answer · 2015-11-23T17:49:38.023

2

You could melt your data first:

library(reshape2)

df <- melt(mydf,id.vars = 1)
df1 <- ddply(df, c("Factor","variable"), function(x) my.summary(x$value))

If you want to split the data by the different variables you can use split():

df2 <- split(df1,df1$variable)

And if you want those split dataframes in the global environment, you can use list2env() which will make two new dataframes, Variable1 and Variable2 (or more if you have more variables):

list2env(df2,.GlobalEnv)

edited Nov 23 '15 at 17:49

answered Nov 23 '15 at 16:50

Sam Dickson

5,082
1
27
45

thanks for that. It works nicely. But what if I want a separate dataframe for each variable? – Rory Shaw Nov 23 '15 at 17:21
I've updated to show how you can split the results into a list of dataframes, which can then be sent to the global environment. – Sam Dickson Nov 23 '15 at 17:50
Great, just been trawling google to find out how to do that to no avail! – Rory Shaw Nov 23 '15 at 17:56

akrun · Answer 3 · 2015-11-23T17:14:21.833

We can use data.table

library(data.table)
 setDT(mydf)[, unlist(lapply(.SD, my.summary),recursive=FALSE), Factor]
 my.summary <- function(x, na.rm=TRUE){list(n= length(x),
                                  Mean=mean(x),
                                  SD=sd(x),
                                  Median=median(x), 
                                  Min=min(x),
                                  Max=max(x))}

Summary stats by factor level for multiple variables

3 Answers3