0

stackoverflow newbie here... I have read lots of aggregate(), by() and tapply() guidances but didn't find answer.

Using the example in R help page(warpbreaks is a data set in R),

> aggregate(breaks ~ wool + tension, data = warpbreaks, mean)
  wool tension   breaks
1    A       L 44.55556
2    B       L 28.22222
3    A       M 24.00000
4    B       M 28.77778
5    A       H 24.55556
6    B       H 18.77778

But how should I code if I also need the result of all supersets (like row 7 to 10 below)?

  wool tension   breaks
1    A       L 44.55556
2    B       L 28.22222
3    A       M 24.00000
4    B       M 28.77778
5    A       H 24.55556
6    B       H 18.77778
7    A       -           #mean of the set that wool=A, but no restriction to tension
8    B       - 
9    -       L           #mean of the set that tension=L, but no restriction to wool
10   -       -           #mean of the whole set in data frame

It is also okay if you have methods without using aggregate function. Thanks a lot!


Hi all, thanks for your answers! Actually I have 40+ subsets, and 200+ variables to calculate (not only one variable "breaks" in example). Thus I find it inefficient to use tapply or aggregate(breaks ~ tension, data = warpbreaks, mean) and then merge results. Plz tell me if there are better ways for data manipulation in this case!

  • `aggregate(breaks ~ tension, data = warpbreaks, mean)` and the same for wool, is that what you're asking? – rawr Nov 05 '15 at 22:09
  • `aggregate(breaks ~ 0, data = warpbreaks, mean)` or simply `mean(warpbreaks$breaks)` – jogo Nov 05 '15 at 22:10
  • 1
    this was the first question I ever asked on SO! Perhaps the answers are dated by now though, http://stackoverflow.com/questions/16824544/apply-a-function-to-dataframe-subsetted-by-all-possible-combinations-of-categori – Rorschach Nov 05 '15 at 22:27
  • Another partial dupe: http://stackoverflow.com/q/31164350/1191259 – Frank Nov 06 '15 at 03:41

3 Answers3

0

I am sure there is a more elegant way but what about a simple tapply ?, after that with a little data manipulation you can combine the results and achieve what you want.

> tapply(warpbreaks$breaks, warpbreaks$tension, mean)
       L        M        H 
36.38889 26.38889 21.66667 
> tapply(warpbreaks$breaks, warpbreaks$wool, mean)
       A        B 
31.03704 25.25926 
SabDeM
  • 7,050
  • 2
  • 25
  • 38
0

Here's a rather ugly answer

library(dplyr)

variables =  c("wool", "tension")

1:length(variables) %>%
  lapply(. %>% combn(variables, ., simplify = F)) %>%
  unlist(recursive = F) %>%
  c(list(character(0))) %>%
  data_frame(variables = .) %>%
  rowwise %>%
  do({group_by_(warpbreaks, .dots = variables) %>%
      summarize(breaks = mean(breaks))})
bramtayl
  • 4,004
  • 2
  • 11
  • 18
0

Thanks to all of you. Learned a lot from this. Dupe answers: dplyr summarize with subtotals gives the grid by expand.grid, and filled it using function.

To my case, as I have more than one variable to sum up in my real data (2000+ variables rather than one "break"), I find the ugly answer fastest.

result1 <- aggregate(breaks ~ wool + tension, data = warpbreaks, mean)
result2 <- aggregate(breaks ~ wool, data = warpbreaks, mean)
result3 <- aggregate(breaks ~ tension, data = warpbreaks, mean)
result4 <- aggregate(breaks ~ 0, data = warpbreaks, mean)
result <- rbind(result1,result2, result3,result4)
Community
  • 1
  • 1