0

I have the following data:

test <- data.frame(Var1 = factor(c("A", "B", "A", "A", "B")),
                   Var2 = factor(c("I", "II", "II", "I", "III")),
                   x = c(1,1,2,3,4)
                   )

I would like to get the sum of x for every level of each variable separately. Aggregate does not do what I need:

>aggregate(test$x, by=list("Var1"= test$Var1,"Var2" = test$Var2), sum)
  Var1 Var2 x
1    A    I 4
2    A   II 2
3    B   II 1
4    B  III 4

I would like it to be the following:

Var1-A 6
Var1-B 5
Var2-I 4
Var2-II 3
Var2-III 4

In other words, it is a loop over each variable, and then each variable's levels.

How can I do this without loops? I figure a reshape is in order, but I'm not sure how to do it.

The_Anomaly
  • 2,385
  • 3
  • 18
  • 22
  • Yeah, if you reshape it, then it's straightforward. With data.table, that looks like `melt(setDT(test), id="x")[, sum(x), by=.(variable, value)]` – Frank Mar 07 '17 at 00:04
  • @Frank that is just what I need, thank you. Would you make it an answer and I can accept it? – The_Anomaly Mar 07 '17 at 00:11
  • 1
    I don't know; I think it's sort of a dupe, though you're asking two frequent questions in one (how to reshape, how to sum by group). I'll leave it open for now. Glad it works for you. For reference, these are the two FAQs: http://stackoverflow.com/questions/2185252/reshaping-data-frame-from-wide-to-long-format and http://stackoverflow.com/questions/1660124/how-to-sum-a-variable-by-group – Frank Mar 07 '17 at 00:12

2 Answers2

1

This works (with a warning message because factors with different levels are being gathered into a single column):

test %>% 
  gather(level1, level2, -x) %>% 
  group_by(level1, level2) %>% 
  summarise(sum(x))

Result:

    level1 level2 `sum(x)`
     <chr>  <chr>    <dbl>
  1   Var1      A        6
  2   Var1      B        5
  3   Var2      I        4
  4   Var2     II        3
  5   Var2    III        4

There are fixes for the factor levels issue in this answer.

Community
  • 1
  • 1
neilfws
  • 32,751
  • 5
  • 50
  • 63
1

The OP wants to compute the aggregate for each variable separately but wants to combine the separate results in one data frame. In SQL the latter step would correspond to a UNION operation.

With data.table, this can be achieved by rbindlist():

library(data.table)
setDT(test)
cols <- c("Var1", "Var2")
rbindlist(
  lapply(cols, function(.col) test[, sum(x), by = .col]), idcol = TRUE
)[, setnames(.SD, c("var", "lvl", "sum_x"))][, var := factor(var, labels = cols)][]
#    var lvl sum_x
#1: Var1   A     6
#2: Var1   B     5
#3: Var2   I     4
#4: Var2  II     3
#5: Var2 III     4

Explanation

  1. Store the variable names to loop over in cols.
  2. Loop over the variables using lapply() and for each variable compute the aggregates grouped by the levels within the variable. The result is a list of data.tables.
  3. The list is combined row-wise using rbindlist. The id column gives the number of the list element the rows are taken from.
  4. Rename the columns.
  5. Convert the number in the id column to the names of the variables.

Note, we could have named the list elements with the variable names by setNames() before calling rbindlist() but this would come with a perfomance penalty due to copying the result. Instead, this is done in the last step where only factor labels have to be assigned to the numbers in the id column.


Of course, there is an alternative approach (similar to this answer) which first reshapes the data from wide to long format and then does the aggregates grouped by variable and level.

melt(setDT(test), "x", value.name = "level")[, .(sum_x = sum(x)), by = .(variable, level)]
#   variable level sum_x
#1:     Var1     A     6
#2:     Var1     B     5
#3:     Var2     I     4
#4:     Var2    II     3
#5:     Var2   III     4

data.tablereports no issues with factor levels.


It would be interesting to see which of the two approaches is more efficient in terms of memory consumption and speed for large problems.

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134