1

I have a big data.frame that I want to aggregate by groupings of categorical variables in another. One method would be:

cars = mtcars
carb_grps = data.frame(carb = 1:8, carb_grp = rep(c('Low','Mid','High'), c(2,2,4)))
cars = merge(cars, carb_grps, by = 'carb')
aggregate(mpg ~ carb_grp, cars, mean)
  carb_grp      mpg
1     High 17.35000
2      Low 23.61176
3      Mid 15.90769

But this replicates all the carb_grp details in the large data.table, which I'm guessing ties up more memory? I wonder if there's a more elegant/efficient way in R to achieve this?

geotheory
  • 22,624
  • 29
  • 119
  • 196

1 Answers1

0

I think this is a great way of doing this. Here is the dplyr equivalent.

library(dplyr)

data_frame(carb = 1:8, 
           carb_grp = rep(c('Low','Mid','High'), 
                          c(2,2,4) ) ) %>%
  right_join(mtcars) %>%
  group_by(carb_grp) %>%
  summarize(mpg = mean(mpg) )
bramtayl
  • 4,004
  • 2
  • 11
  • 18
  • Thanks. A dplyr solution is preferable and I like the example. But I do note this is essentially the same as my method in that the `carb_grp` data is replicated for each data row. Do you know how R is treating this column under the hood? Would e.g. the `carb` and `carb_grp` columns need to be _factor_ class to optimise memory? – geotheory May 11 '16 at 17:36
  • @geotheory factors are not more efficient in terms of memory storage relative to character vectors as character vectors are stored in hash (some sort of magic). This has been true since around R 2.8. See this [post](http://stackoverflow.com/questions/36507061/what-is-a-good-rule-of-thumb-on-when-to-factorize-columns-in-r/36507363#36507363) and the links therein for details. – lmo May 11 '16 at 17:43
  • @lmo That's confusing because _factor_ does affect the value returned by `object.size` (see appendum to question). I do note the function provides an 'estimate'.. – geotheory May 11 '16 at 17:55
  • Read that hash documentation - I guess object.size must be wrong, although why it doesn't reflect R's new quantum mechanics is strange.. – geotheory May 11 '16 at 18:02
  • @geotheory Yeah, I don't know enough about the internals to give a reasonable answer, I'm just towing the company line. – lmo May 11 '16 at 18:03