8

Given this data.frame:

set.seed(4)
df <- data.frame(x = rep(1:5, each = 2), y = sample(50:100, 10, T))
#    x  y
# 1  1 78
# 2  1 53
# 3  2 93
# 4  2 96
# 5  3 61
# 6  3 82
# 7  4 53
# 8  4 76
# 9  5 91
# 10 5 99

I would like to write some simple functions (i.e. feature engineering) to create features for x and then join each of the resulting data.frames together. For example:

library(dplyr)
count_x <- function(df) df %>% group_by(x) %>% summarise(count_x = n())
sum_y   <- function(df) df %>% group_by(x) %>% summarise(sum_y = sum(y))
mean_y  <- function(df) df %>% group_by(x) %>% summarise(mean_y = mean(y))  
# and many more...

This can be accomplished with plyr::join_all but I am wondering if there is better (or more performant) method with dplyr or data.table?

df_with_features <- plyr::join_all(list(count_x(df), sum_y(df), mean_y(df)),
                                   by = 'x', type = 'full')

# > df_with_features
#   x count_x sum_y mean_y
# 1 1       2   131   65.5
# 2 2       2   189   94.5
# 3 3       2   143   71.5
# 4 4       2   129   64.5
# 5 5       2   190   95.0
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
  • 3
    the dplyr format would be: `df %>% group_by(x) %>% summarise(count_x = n(), sum_y = sum(y), mean_y = mean(y))`. you get all the columns done in one step. – jeremycg Nov 24 '15 at 14:04
  • @jeremycg Definitely, but imagine more complicated functions where I can't pass everything into one statement. – JasonAizkalns Nov 24 '15 at 14:10
  • 2
    @JasonAizkalns can you demo where this wouldn't work? – Tyler Rinker Nov 24 '15 at 14:14
  • 1
    @TylerRinker @jeremycg *actually*, just see jeremy's [blog post](http://jeremy.kiwi.nz/2015/06/12/code-for-final-kaggle-model.html) for examples (ha! I knew I shamelessly stole this concept/idea from somewhere...that'll teach you to read Kaggle posts late at night...). I would have to modify my example a bit, but his blog post points to situations where one call to `summarise` would not work (or at least situations where things would be more difficult to maintain). – JasonAizkalns Nov 24 '15 at 14:18
  • @JasonAizkalns huh, someone read my blog! In cases like that, where you have used different groupings etc, I guess `join_all` does work. I think if I had to write it without `plyr`, I'd use `Reduce` and `merge` (or one of the dplyr merges). – jeremycg Nov 24 '15 at 14:23
  • 2
    [see here for an example of `Reduce` & `merge`](http://stackoverflow.com/a/33031393/2204410) – Jaap Nov 24 '15 at 14:42
  • This function should definitely be added to dplyr as well @hadley – Ömer An Aug 06 '20 at 06:02

2 Answers2

6

Combining @SimonOHanlon's data.table method with @Jaap's Reduce and merge techniques appears to yield the most performant results:

library(data.table)
setDT(df)
count_x_dt <- function(dt) dt[, list(count_x = .N), keyby = x]
sum_y_dt   <- function(dt) dt[, list(sum_y = sum(y)), keyby = x]
mean_y_dt  <- function(dt) dt[, list(mean_y = mean(y)), keyby = x]

Reduce(function(...) merge(..., all = TRUE, by = c("x")), 
       list(count_x_dt(df), sum_y_dt(df), mean_y_dt(df)))

Updating to include a tidyverse / purrr (purrr::reduce) approach:

library(tidyverse)
list(count_x(df), sum_y(df), mean_y(df)) %>% 
  reduce(left_join) 
camille
  • 16,432
  • 18
  • 38
  • 60
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
2

In data.table parlance this would be the equivalent of having a sorted keyed data.table and using the key to join the various data.tables.

e.g.

require(data.table)
setDT(df)  #df is now a data.table
df_count <- df[ , list(count_x=.N),by=x]
df_sum <- df[ , list(sum_y = sum(y)),by=x]
#  merge.data.table executes a fast join on the shared key
merge(df_count,df_sum)
#   x count_x sum_y
#1: 1       2   129
#2: 2       2   128
#3: 3       2   154
#4: 4       2   182
#5: 5       2   151

In your example you might write something like this:

count_x <- function(dt) dt[ , list(N = .N) , keyby=x ]
sum_y <- function(dt) dt[ , list(Sum=sum(y)),keyby=x]

#  Then merge...
merge(sum_y(df),count_x(df))
#   x Sum N
#1: 1 129 2
#2: 2 128 2
#3: 3 154 2
#4: 4 182 2
#5: 5 151 2
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184