3

I have seen a couple of posts of how to write one's own function with dplyr functions. For example, you can see how you can use group_by (regroup) and summarise in this post. I thought that it would be interesting to see if I can write a function using major dplyr functions. My hope is that we can further understand how to write functions using dplyr functions.

DATA

country <- rep(c("UK", "France"), each = 5)
id <- rep(letters[1:5], times = 2)
value <- runif(10, 50, 100)
foo <- data.frame(country, id, value, stringsAsFactors = FALSE)

GOAL

I wanted to write the following process in a function.

foo %>%
    mutate(new = ifelse(value > 60, 1, 0)) %>%
    filter(id %in% c("a", "b", "d")) %>%
    group_by(country) %>%
    summarize(whatever = sum(value))

TRY

### Here is a function which does the same process

myFun <- function(x, ana, bob, cathy) x %>%
    mutate(new = ifelse(ana > 60, 1, 0)) %>%
    filter(bob %in% c("a", "b", "d")) %>%
    regroup(as.list(cathy)) %>%
    summarize(whatever = sum(ana))

myFun(foo, value, id, "country")

Source: local data frame [2 x 2]

  country whatever
1  France 233.1384
2      UK 245.5400

You may realise that arrange() is not there. This is the one I am struggling. Here are two observations. The first experiment was successful. The order of the countries changed from UK-France to France-UK. But the second experiment was not successful.

### Experiment 1: This works for arrange()

myFun <- function(x, ana) x %>%
         arrange(ana)

myFun(foo, country)

   country id    value
1   France  a 90.12723
2   France  b 86.64229
3   France  c 74.93320
4   France  d 80.69495
5   France  e 72.60077
6       UK  a 84.28033
7       UK  b 67.01209
8       UK  c 94.24756
9       UK  d 79.49848
10      UK  e 63.51265


### Experiment2: This was not successful.

myFun <- function(x, ana, bob) x %>%
         filter(ana %in% c("a", "b", "d")) %>%
         arrange(bob)

myFun(foo, id, country)

Error: incorrect size (10), expecting :6

### This works, by the way.
foo %>%
filter(id %in% c("a", "b", "d")) %>%
arrange(country)

Given the first experiment was successful, I have a hard time to understand why the second experiment failed. There may be something one has to do in the 2nd experimentDoes anybody have an idea? Thank you for taking your time.

Community
  • 1
  • 1
jazzurro
  • 23,179
  • 35
  • 66
  • 76

2 Answers2

7

I installed dplyr 0.3 and lazyeval once issue 352 was closed to see how it might work to use dplyr functions in another function. After reading the vignette on non-standard evaluation, it looks like interp from lazyeval combined with the new functions ending in _ is one option. Notice group_by_ now replaces regroup.

set.seed(16)
foo = data.frame(country = rep(c("UK", "France"), each = 5), 
               id = rep(letters[1:5], times = 2), 
               value = runif(10, 50, 100), stringsAsFactors = FALSE)

First the code/results outside the function:

library(lazyeval)
library(dplyr)

foo %>%
    mutate(new = ifelse(value > 60, 1, 0)) %>%
    filter(id %in% c("a", "b", "d")) %>%
    group_by(country) %>%
    summarize(whatever = sum(value))

Source: local data frame [2 x 2]

  country whatever
1  France 213.0009
2      UK 207.8331

Then turn the above process into a function:

myFun = function(x, ana, bob, cathy) {
    x %>%
        mutate_(new = interp(~ifelse(var > 60 , 1, 0), var = as.name(ana))) %>%
        filter_(interp(~var %in% c("a", "b", "d"), var = as.name(bob))) %>%
        group_by_(cathy) %>%
        summarize_(whatever = interp(~sum(var), var = as.name(ana)))
}

Which gives the desired results.

myFun(foo, "value", "id", "country")
Source: local data frame [2 x 2]

  country whatever
1  France 213.0009
2      UK 207.8331

For your second problem with arrange, I tried

myfun2 = function(x, ana, bob) x%>%
    filter_(interp(~var %in% c("a", "b", "d"), var = as.name(ana))) %>%
    arrange_(as.name(bob))

myfun2(foo, "id", "country")
aosmith
  • 34,856
  • 9
  • 84
  • 118
  • Thank you very much for this update. It seems that the way one writes functions using `dplyr` is now simpler than what I was thinking. This is great stuff. – jazzurro Oct 01 '14 at 16:10
  • Hey man, I have been playing with your code today and I am struggling with `arrange`. `arrange_(as.name(ana), as.name(bob))` works fine. But, I wanna add `desc` for bob. `arrange_(as.name(ana), ~desc(as.name(bob))))` has no error, bur it does not work. `arrange_(interp(as.name(ana), ~desc(as.name(bob))))` is the same. Do you have any ideas? – jazzurro Oct 03 '14 at 15:22
  • I got it now. `arrange_(as.name(ana), interp(~desc(var), var = as.name(bob)))` I am still confused, but since `desc` is another function, you gotta do `interp()` in `arrange_` – jazzurro Oct 03 '14 at 15:42
  • @jazzurro Looks good - that's what I would have tried. I don't have a very good grasp on all of this yet - I keep thinking I'm overcomplicating things. As things mature, I'm sure this will all become clearer. – aosmith Oct 03 '14 at 15:47
3

Actually, your experiments do not work, you will have scoping problems with all of them. It looks like they are working because you have defined the vectors country, id, and value on the Global Environment and did not remove them. So when you call your functions, they are using the vectors from the Global Environment.

To show this, let's remove those vectors before calling your functions:

Creating the vectors and data.frame:

library(dplyr)
country <- rep(c("UK", "France"), each = 5)
id <- rep(letters[1:5], times = 2)
value <- runif(10, 50, 100)
foo <- data.frame(country, id, value, stringsAsFactors = FALSE)

Defining your first function:

myFun <- function(x, ana, bob, cathy) x %>%
  mutate(new = ifelse(ana > 60, 1, 0)) %>%
  filter(bob %in% c("a", "b", "d")) %>%
  regroup(as.list(cathy)) %>%
  summarize(whatever = sum(ana))

Calling without removing the vectors (it will look like it works, but it is actually using the vectors from the global env):

myFun(foo, value, id, "country")
Source: local data frame [2 x 2]

  country whatever
1  France 208.1008
2      UK 192.4287

Now removing the vectors and calling your function (and now it does not work, for it can't find the vectors):

rm(country, id, value)
myFun(foo, value, id, "country")

Error in mutate_impl(.data, named_dots(...), environment()) :
object 'value' not found

So that explains why your arrange example did not work while the others did. The vector your second experiment was calling was the vector country on the Global Environment, which has 10 elements. But the function arrange was expecting only 6 elements, which is the result of the filtered vector.

You have different strategies to make your functions work. For example, take a look at this answer by G. Grothendieck to have some insights on how to do it. Or just wait a little, for as Hadley pointed out, programming in dplyr is a future feature coming soon.

Community
  • 1
  • 1
Carlos Cinelli
  • 11,354
  • 9
  • 43
  • 66