2

I am always unsure how to retrieve a summary with dplyr.

Let us suppose I have a summary of individuals and households.

dta = rbind(c(1, 1, 45), 
  c(1, 2, 47), 
  c(2, 1, 24),
  c(2, 2, 26), 
  c(3, 1, 67), 
  c(4, 1, 20),
  c(4, 2, 21),
  c(5, 3, 7)
 ) 
dta = as.data.frame(dta)
colnames(dta) = c('householdid', 'id', 'age')

 householdid id age
           1  1  45
           1  2  47
           2  1  24
           2  2  26
           3  1  67
           4  1  20
           4  2  21
           4  3   7

Imagine I want to calculate the number of person in the household and the mean age by households and then re-use this information in the original dataset.

dta %>% 
  group_by(householdid) %>% 
  summarise( nhouse = n(), meanAgeHouse = mean(age) ) %>% 
  merge(., dta, all = T)

I am often using merge, but it is slow sometimes when the dataset is huge.
Is it possible to

mutate 

instead of

merge ? 
giac
  • 4,261
  • 5
  • 30
  • 59
  • 4
    Yes, just do `dta %>% group_by(householdid) %>% mutate( nhouse = n(), meanAgeHouse = mean(age) )` – David Arenburg Jun 08 '15 at 14:59
  • 1
    I would also suggest looking into _data.table_ package. These things are pretty straight forwad and very fast in data.table. It has the concept of recycling values which will be helpful here. – nehiljain Jun 08 '15 at 15:02
  • 1
    The solution provide by @DavidArenburg is excellent, if you want to keep results just use that code with an assignment `data <- code by David`. It seems reasonable, dplyr is smart and fast enough to do not reallocate memory but just point to the old object added with the new elements. – SabDeM Jun 08 '15 at 15:51
  • @DavidArenburg thank you very much ! Put it as an answer please – giac Jun 08 '15 at 16:17

1 Answers1

0
dta %>% group_by(householdid) %>% mutate( nhouse = n(), meanAgeHouse = mean(age) )
PKumar
  • 10,971
  • 6
  • 37
  • 52
3pitt
  • 899
  • 13
  • 21