Aggregating two columns in R dataframe

Question

I have a dataframe in R called food:

foodID   calories   fat    protein

 123       0.5      0.4     0.9
 432       0.65     0.3     0.7
 123       0.32     0.6     0.5
 983       0.82     0.2     0.6

and I'm trying to average up the calories and protein column by foodID.

I tried:

cal_pro <- aggregate(food[2,4], list(food$foodID), function(df) mean(df))

But it appears that i can't select the columns to be applied the mean function by food[2,4]? Could anyone help me out on this.

Find more alternatives [here](https://stackoverflow.com/a/16657546/1315767) — Jilber Urbina, Apr 12 '19 at 14:33
Possible duplicate of [Apply function conditionally](https://stackoverflow.com/questions/16657512/apply-function-conditionally) — shea, Apr 12 '19 at 15:37

score 1 · Answer 1 · answered Apr 12 '19 at 20:04

Using dplyr, you can just group_by and summarize:

food %>%
    group_by(foodID) %>%
    summarize(calories_average = mean(calories),
              protein_average = mean(protein))

# A tibble: 3 x 3
  foodID calories_average protein_average
   <int>            <dbl>           <dbl>
1    123             0.41             0.7
2    432             0.65             0.7
3    983             0.82             0.6

Rather than specifying each variable, you can use summarize_at to select multiple variables to summarize at once. We pass in 2 arguments: the variables to summarize, and a list of functions to apply to them. If the list is named, as it is here, then the name is added to the summary column as a suffix (giving "calores_average" and "protein_average":

food %>%
    group_by(foodID) %>%
    summarize_at(c('calories', 'protein'), list(average = mean))

summarize_at also allows you to use various helper functions to select variables by prefix, suffix, or regex (as shown below). You can learn more about them here: ?tidyselect::select_helpers

food %>%
    group_by(foodID) %>%
    summarize_at(vars(matches('calories|protein')), list(average = mean))

akrun · Answer 2 · 2019-04-12T15:06:14.990

0

We can use the formula method

aggregate(cbind(calories, protein) ~ foodID, food, mean)

Or using the OP's code, it should be c(2, 4), because if we do 2, 4, it is selecting the 2nd row of 4th column by row/column indexing

aggregate(food[c(2, 4)], list(food$foodID), mean)

EDIT: Based on @RuiBarradas comments

edited Apr 12 '19 at 15:06

answered Apr 12 '19 at 14:30

akrun

874,273
37
540
662

1

I believe it should be `c(2, 4)` not `2:4`. – Rui Barradas Apr 12 '19 at 14:36
@akrun doing [2:4] applies the mean function to the third column as well isn't it? Does this mean i will have to remove the third column later on? – Maxxx Apr 12 '19 at 14:36
@Maxx Sorry i thought you want a sequence of columns – akrun Apr 12 '19 at 14:37
@Maxxx Do you expect anything else in the solution. – akrun Apr 12 '19 at 14:48

score 0 · Answer 3 · answered Apr 12 '19 at 15:28

You can use data.table package-

> setDT(dt)[,list(avg_calorie=mean(calories),avg_protein=mean(protein)),by=foodID]

Output-

    foodID avg_calorie avg_protein
1:    123        0.41         0.7
2:    432        0.65         0.7
3:    983        0.82         0.6

Aggregating two columns in R dataframe

3 Answers3