0

I'm dealing with a somewhat big dataframe with 341k observations and 1.4k variables. The observations contain individual data for various countries over several years. I now want to create the year based means of individual data for each country:

set.seed(42)
ctry <- c(rep("AU", 6),rep("BY", 6),rep("CH", 6),rep("DE", 6))
year <- c(rep(2001,2),rep(2002,2),rep(2003,2),rep(2001,2),rep(2002,2),rep(2003,2),
      rep(2001,2),rep(2002,2),rep(2003,2),rep(2001,2),rep(2002,2),rep(2003,2))
a <- rnorm(24,1,.1)
b <- rnorm(24,2,.2)
c <- rnorm(24,3,.3)
(df <- data.frame(ctry,year,a,b,c))

I found this matching my problem and Peter Lustig gave a great answer. For myself I came to a solution with the ave() function but it's quite inconvenient for 1.4k variables:

df1 <- df
df1$mean.a <- ave(df[[3]], df[[2]], df[[1]])
df1$mean.b <- ave(df[[4]], df[[2]], df[[1]])
df1$mean.c <- ave(df[[5]], df[[2]], df[[1]])
(df2 <- unique(df1[,c(1,2,(ncol(df1)-2):ncol(df1))]))

akrun's function looks promising but I was not able to fit it for me. Maybe there's a more neat solution? Thanks.

Community
  • 1
  • 1
jay.sf
  • 60,139
  • 8
  • 53
  • 110

1 Answers1

0

Based on the OP's output with ave, it seems like we are grouping by the same columns and get the mean of other columns. So, instead of using ave we need to summarise by group. This can be done with dplyr

library(dplyr)
df %>% 
   group_by(ctry, year) %>% 
   summarise_at(.vars = names(df)[3:5], funs(Mean = mean))

Or if we need to get the mean of all variables except the grouping variables, as @StevenBeaupre suggested, we can use summarise_all

df %>% 
   group_by(ctry, year) %>% 
   summarise_all(mean)
#    ctry  year         a        b        c
#   <fctr> <dbl>     <dbl>    <dbl>    <dbl>
#1      AU  2001 1.0403130 2.146472 3.033630
#2      AU  2002 1.0497996 1.797957 2.930713
#3      AU  2003 1.0149072 1.982010 3.332794
#4      BY  2001 1.0708431 2.116029 3.054947
#5      BY  2002 1.0977855 2.042618 3.115368
#6      BY  2003 1.1795758 1.878795 2.593769
#7      CH  2001 0.9166175 1.836463 2.972699
#8      CH  2002 1.0251315 1.762191 3.297234
#9      CH  2003 0.8529646 1.984494 3.086288
#10     DE  2001 0.9439823 2.003146 3.206153
#11     DE  2002 0.8956026 1.906454 3.246241
#12     DE  2003 1.0521379 2.063271 2.830004

For convenience, we can also wrap this in a function making use of quosures from the devel version of dplyr

fMean <- function(dat, grp1, grp2, otherVars){
    #enquo does similar functionality as substitute from base R
    grp1 <- enquo(grp1) 
    grp2 <- enquo(grp2)

    dat %>%
       #evaluate the quosure by unquote (UQ or !!)
       group_by(!!grp1, !!grp2) %>%
       summarise_at(.vars = otherVars, funs(Mean = mean))


} 

fMean(df, ctry, year, names(df)[3:5])
#     ctry  year    a_Mean   b_Mean   c_Mean
#   <fctr> <dbl>     <dbl>    <dbl>    <dbl>
#1      AU  2001 1.0403130 2.146472 3.033630
#2      AU  2002 1.0497996 1.797957 2.930713
#3      AU  2003 1.0149072 1.982010 3.332794
#4      BY  2001 1.0708431 2.116029 3.054947
#5      BY  2002 1.0977855 2.042618 3.115368
#6      BY  2003 1.1795758 1.878795 2.593769
#7      CH  2001 0.9166175 1.836463 2.972699
#8      CH  2002 1.0251315 1.762191 3.297234
#9      CH  2003 0.8529646 1.984494 3.086288
#10     DE  2001 0.9439823 2.003146 3.206153
#11     DE  2002 0.8956026 1.906454 3.246241
#12     DE  2003 1.0521379 2.063271 2.830004

In addition to dplyr, we can also use data.table

library(data.table)
setDT(df)[, lapply(.SD, mean), .(ctry, year)]

Here, setDT, converts the 'data.frame' to data.table, grouped by 'ctry', 'year', we loop through the columns of Subset of Data.table and get the mean. If we need to get only the mean of specific column, specify it in .SDcols

setDT(df)[, lapply(.SD, mean), .(ctry, year), .SDcols = a:c]

Or base R with aggregate

aggregate(.~ctry + year, df, mean)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I get the `Error: .cols should be a character/numeric vector or a columns object` – jay.sf Apr 17 '17 at 11:07
  • @jaySf I am using the `devel` version of `dplyr` (soon to be released 0.6.0`. what is the version you have – akrun Apr 17 '17 at 11:10
  • 1
    I have `dplyr` 0.5.0. `summarise_at` didn't work, but `summarise_all` did! I'm looking forward for 0.6.0, looks great! – jay.sf Apr 17 '17 at 11:38
  • 1
    @jaySf You can download the devel version from github i.e. `https://github.com/tidyverse/dplyr` and use `devtools::install_github("tidyverse/dplyr")` I added the data.table and base R methods – akrun Apr 17 '17 at 11:40