3

I want to produce dataframes containing summary statistics for each factor level for multiple variables.

For example if I have the following dataframe

Factor <- c("A","A","A","B","B","B")
Variable1 <- c(3,4,5,4,5,3)
Variable2 <- c(7,9,14,16,10,10)
mydf <- data.frame(Factor, Variable1, Variable2)
mydf
  Factor Variable1 Variable2
1      A         3         7
2      A         4         9
3      A         5        14
4      B         4        16
5      B         5        10
6      B         3        10

and I have the following function that I want to use to produce my summary stats:

my.summary <- function(x, na.rm=TRUE){result <- c(n=as.integer(length(x)),
Mean=mean(x, na.rm=TRUE), SD=sd(x, na.rm=TRUE), SeM = SEM(x),
Median=median(x),   Min=min(x), Max=max(x))}

To apply this to factor levels of Variable1 I can do this:

ddply(mydf, c("Factor"), function(x) my.summary(x$Variable1))
  Factor n Mean SD       SeM Median Min Max
1      A 3    4  1 0.5773503      4   3   5
2      B 3    4  1 0.5773503      4   3   5

Now I can do the same for Variable 2:

ddply(mydf, c("Factor"), function(x) my.summary(x$Variable2))

Which is easy enough if I just have 2 variables. However, if I had lots of variables this would be a pain. So how can I solve this so that I can produce a dataframe of the summary stats for each variable/factor level without having to adjust the code?

I have tried using aggregate.data.frame but it doesn't work using my.summary. It works using summary but produces one big data frame.

Thanks

Rory Shaw
  • 811
  • 2
  • 9
  • 26
  • 1
    `melt` your data into long format so you have `variable` column and a `value` column, then use both `"Factor"` and `"variable"` as grouping variables. – Gregor Thomas Nov 23 '15 at 16:46

3 Answers3

3

You could use summarise_each from dplyr:

library(dplyr)

mydf %>% group_by(Factor) %>%
         summarise_each(funs(my.summary(.)))

After modifying your function to return a list:

my.summary <- function(x, na.rm=TRUE){result <- list(c(n=as.integer(length(x)),
                                                  Mean=mean(x, na.rm=TRUE), SD=sd(x, na.rm=TRUE),
                                                  Median=median(x),   Min=min(x), Max=max(x)))}
jeremycg
  • 24,657
  • 5
  • 63
  • 74
2

You could melt your data first:

library(reshape2)

df <- melt(mydf,id.vars = 1)
df1 <- ddply(df, c("Factor","variable"), function(x) my.summary(x$value))

If you want to split the data by the different variables you can use split():

df2 <- split(df1,df1$variable)

And if you want those split dataframes in the global environment, you can use list2env() which will make two new dataframes, Variable1 and Variable2 (or more if you have more variables):

list2env(df2,.GlobalEnv)
Sam Dickson
  • 5,082
  • 1
  • 27
  • 45
2

We can use data.table

library(data.table)
 setDT(mydf)[, unlist(lapply(.SD, my.summary),recursive=FALSE), Factor]
 my.summary <- function(x, na.rm=TRUE){list(n= length(x),
                                  Mean=mean(x),
                                  SD=sd(x),
                                  Median=median(x), 
                                  Min=min(x),
                                  Max=max(x))}
akrun
  • 874,273
  • 37
  • 540
  • 662