14

I am trying to use pipe mutate statement using a custom function. I looked a this somewhat similar SO post but in vain. Say I have a data frame like this (where blob is some variable not related to the specific task but is part of the entire data) :

df <- 
  data.frame(exclude=c('B','B','D'), 
             B=c(1,0,0), 
             C=c(3,4,9), 
             D=c(1,1,0), 
             blob=c('fd', 'fs', 'sa'), 
             stringsAsFactors = F)

I have a function that uses the variable names so select some based on the value in the exclude column and e.g. calculates a sum on the variables not specified in exclude (which is always a single character).

FUN <- function(df){
  sum(df[c('B', 'C', 'D')] [!names(df[c('B', 'C', 'D')]) %in% df['exclude']] )
}

When I gives a single row (row 1) to FUN I get the the expected sum of C and D (those not mentioned by exclude), namely 4:

FUN(df[1,])

How do I do similarly in a pipe with mutate (adding the result to a variable s). These two tries do not work:

df %>% mutate(s=FUN(.))
df %>% group_by(1:n()) %>% mutate(s=FUN(.))

UPDATE This also do not work as intended:

df %>% rowwise(.) %>% mutate(s=FUN(.))

This works of cause but is not within dplyr's mutate (and pipes):

df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))
user3375672
  • 3,728
  • 9
  • 41
  • 70
  • Could you give an example? This does not work: `df %>% rowwise(.) %>% mutate(s=FUN(.))`. The answer from @konvas appears somewhat clumpsy (see my comment to @konvas's suggestion) – user3375672 May 30 '17 at 18:00
  • 2
    Sorry, I misunderstood.You can do `df %>% rowwise(.) %>% mutate(s=FUN(data.frame(exclude = exclude, B = B, C = C, D = D)))`, or `df %>% rowwise() %>% nest(exclude:D) %>% mutate(s = map_dbl(data, FUN)) %>% unnest()`. The issue is that `dplyr` does not naturally operate on sub data.frames, but on columns (i.e. vectors). So some additional trickery will be necessary. – Axeman May 30 '17 at 20:58
  • This is more what I am after ! (For the record: it is `tidyr::nest()` and `purrr::map_dbl()` ). – user3375672 May 31 '17 at 05:11

2 Answers2

15

If you want to use dplyr you can do so using rowwise and your function FUN.

df %>% 
    rowwise %>% 
    do({
        result = as_data_frame(.)
        result$s = FUN(result)
        result
    })

The same can be achieved using group_by instead of rowwise (like you already tried) but with do instead of mutate

df %>% 
    group_by(1:n()) %>% 
    do({
        result = as_data_frame(.)
        result$s = FUN(result)
        result
    })

The reason mutate doesn't work in this case, is that you are passing the whole tibble to it, so it's like calling FUN(df).

A much more efficient way of doing the same thing though is to just make a matrix of columns to be included and then use rowSums.

cols <- c('B', 'C', 'D')
include_mat <- outer(function(x, y) x != y, X = df$exclude, Y = cols)
# or outer(`!=`, X = df$exclude, Y = cols) if it's more readable to you
df$s <- rowSums(df[cols] * include_mat)
konvas
  • 14,126
  • 2
  • 40
  • 46
  • Yes thanks, but my example above (sum) was just a toy example. My aim is to understand how to use a(ny) custom function with dplyr::mutate within the dplyr pipe (`%>%`). The suggestion within a do() statement seems very (as you say) inefficient and statefull - there much be a simpler way. I could also have done much more efficient (but then this is not with dplyr mutate and pipes): `df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))` . – user3375672 May 30 '17 at 17:57
12

purrr approach

We can use a combination of nest and map_dbl for this:

library(tidyverse)
df %>% 
  rowwise %>% 
  nest(-blob) %>% 
  mutate(s = map_dbl(data, FUN)) %>% 
  unnest

Let's break that down a little bit. First, rowwise allows us to apply each subsequent function to support arbitrary complex operations that need to be applied to each row.

Next, nest will create a new column that is a list of our data to be fed into FUN (the beauty of tibbles vs data.frames!). Since we are applying this rowwise, each row contains a single-row tibble of exclude:D.

Finally, we use map_dbl to map our FUN to each of these tibbles. map_dbl is used over the family of other map_* functions since our intended output is numeric (i.e. double).

unnest returns our tibble into the more standard structure.

purrrlyr approach

While purrrlyr may not be as 'popular' as its parents dplyr and purrr, its by_row function has some utility here.

In your above example, we would use your data frame df and user-defined function FUN in the following way:

df %>% 
  by_row(..f = FUN, .to = "s", .collate = "cols")

That's it! Giving you:

# tibble [3 x 6]
  exclude     B     C     D  blob     s
    <chr> <dbl> <dbl> <dbl> <chr> <dbl>
1       B     1     3     1    fd     4
2       B     0     4     1    fs     5
3       D     0     9     0    sa     9

Admittedly, the syntax is a little strange, but here's how it breaks down:

  • ..f = the function to apply to each row
  • .to = the name of the output column, in this case s
  • .collate = the way the results should be collated, by list, row, or column. Since FUN only has a single output, we would be fine to use either "cols" or "rows"

See here for more information on using purrrlyr...


Performance

Forewarning, while I like the functionality of by_row, it's not always the best approach for performance! purrr is more intuitive, but also at a rather large speed loss. See the following microbenchmark test:

library(microbenchmark)
mbm <- microbenchmark(
  purrr.test = df %>% rowwise %>% nest(-blob) %>% 
    mutate(s = map_dbl(data, FUN)) %>% unnest,
  purrrlyr.test = df %>% by_row(..f = FUN, .to = "s", .collate = "cols"),
  rowwise.test = df %>% 
    rowwise %>% 
    do({
      result = as_tibble(.)
      result$s = FUN(result)
      result
    }),
  group_by.test = df %>% 
    group_by(1:n()) %>% 
    do({
      result = as_tibble(.)
      result$s = FUN(result)
      result
    }),
  sapply.test = {df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))}, 
  times = 1000
)
autoplot(mbm)

enter image description here

You can see that the purrrlyr approach is faster than the approach of using a combination of do with rowwise or group_by(1:n()) (see @konvas answer), and rather on par with the sapply approach. However, the package is admittedly not the most intuitive. The standard purrr approach seems to be the slowest, but also perhaps easier to work with. Different user-defined functions may change the speed order.

Dave Gruenewald
  • 5,329
  • 1
  • 23
  • 35