40

This has really challenged my ability to debug R code.

I want to use ddply() to apply the same functions to different columns that are sequentially named; eg. a, b, c. To do this I intend to repeatedly pass the column name as a string and use the eval(parse(text=ColName)) to allow the function to reference it. I grabbed this technique from another answer.

And this works well, until I put ddply() inside another function. Here is the sample code:

# Required packages:
library(plyr)

myFunction <- function(x, y){
    NewColName = "a"
    z = ddply(x, y, summarize,
            Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE)
    )
    return(z)
}

a = c(1,2,3,4)
b = c(0,0,1,1)
c = c(5,6,7,8)
df = data.frame(a,b,c)
sv = c("b")

#This works.
ColName = "a"
ddply(df, sv, summarize,
        Ave = mean(eval(parse(text=ColName)), na.rm=TRUE)
)

#This doesn't work
#Produces error: "Error in parse(text = NewColName) : object 'NewColName' not found"
myFunction(df,sv)

#Output in both cases should be
#  b Ave
#1 0 1.5
#2 1 3.5

Any ideas? NewColName is even defined inside the function!

I thought the answer to this question, loops-to-create-new-variables-in-ddply, might help me but I've done enough head banging for today and it's time to raise my hand and ask for help.

Community
  • 1
  • 1
Look Left
  • 1,305
  • 3
  • 15
  • 20

5 Answers5

23

Today's solution to this question is to make summarize into here(summarize). e.g.

myFunction <- function(x, y){
    NewColName = "a"
    z = ddply(x, y, here(summarize),
            Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE)
    )
    return(z)
}

here(f), added to plyr in Dec 2012, captures the current context.

Peter O
  • 599
  • 1
  • 4
  • 18
  • 1
    Brilliant! When using lubridate and plyr together, make sure you specifically refer to plyr::here() (as lubridate unfortunately redefines here()). – Pierre D Jan 02 '15 at 21:30
14

You can do this with a combination of do.call and call to construct the call in an environment where NewColName is still visible:

myFunction <- function(x,y){
NewColName <- "a"
z <- do.call("ddply",list(x, y, summarize, Ave = call("mean",as.symbol(NewColName),na.rm=TRUE)))
return(z)
}

myFunction(d.f,sv)
  b Ave
1 0 1.5
2 1 3.5
James
  • 65,548
  • 14
  • 155
  • 193
9

I occasionally run into problems like this when combining ddply with summarize or transform or something and, not being smart enough to divine the ins and outs of navigating various environments I tend to side-step the issue by simply not using summarize and instead using my own anonymous function:

myFunction <- function(x, y){
    NewColName <- "a"
    z <- ddply(x, y, .fun = function(xx,col){
                             c(Ave = mean(xx[,col],na.rm=TRUE))}, 
               NewColName)
    return(z)
}

myFunction(df,sv)

Obviously, there is a cost to doing this stuff 'manually', but it often avoids the headache of dealing with the evaluation issues that come from combining ddply and summarize. That's not to say, of course, that Hadley won't show up with a solution...

joran
  • 169,992
  • 32
  • 429
  • 468
  • 2
    Until I fix the bug, this is my recommended workaround. Note that you can use `transform` etc inside your anonymous function. – hadley Aug 07 '11 at 19:09
  • @joran i implemented your solution and it worked for me. I was just inquisitive about why there is this scoping issue in ddply? is it because summarise creates a new dataframe and that doesn't have access to this colName? – joel.wilson Nov 03 '16 at 05:47
  • @user3801801 It has to do with the non-standard evaluation taking place of function arguments. I'd have to go sift through the source code to remind myself of the specific issue, but basically it has to do with how R knows where to evaluate arguments (i.e. in the context of the current enclosure, in the global environment, somewhere in between). – joran Nov 03 '16 at 15:04
  • @joran I got your point! ddply() uses with() inside I guess like data.tables and so a vector having column name stored in it doesn't have scope here I guess. – joel.wilson Nov 03 '16 at 15:30
5

The problem lies in the code of the plyr package itself. In the summarize function, there is a line eval(substitute(...),.data,parent.frame()). It is well known that parent.frame() can do pretty funky and unexpected stuff. T

he solution of @James is a very nice workaround, but if I remember right @Hadley himself said before that the plyr package was not intended to be used within functions.

Sorry, I was wrong here. It is known though that for the moment, the plyr package gives problems in these situations.

Hence, I give you a base solution for the problem :

myFunction <- function(x, y){
    NewColName = "a"
    z = aggregate(x[NewColName],x[y],mean,na.rm=TRUE)
    return(z)
}
> myFunction(df,sv)
  b   a
1 0 1.5
2 1 3.5
Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • +1 For taking my "avoid `summarize`" solution and providing an actual explanation of the problem. ;) – joran Aug 05 '11 at 16:56
  • +1 definitely for taking the time to explain the parent.frame() issue. It seems strange that a function can't be used inside another function because it forces you to write contiguous code. Maybe @Hadley could comment. – Look Left Aug 06 '11 at 04:40
  • I certainly never claimed that plyr was not intended to be used within functions - I've always said that this is a bug which I currently lack the understanding to fix :( – hadley Aug 07 '11 at 19:08
3

Looks like you have an environment problem. Global assignment fixes the problem, but at the cost of one's soul:

library(plyr)

a = c(1,2,3,4)
b = c(0,0,1,1)
c = c(5,6,7,8)
d.f = data.frame(a,b,c)
sv = c("b")

ColName = "a"
ddply(d.f, sv, summarize,
        Ave = mean(eval(parse(text=ColName)), na.rm=TRUE)
)

myFunction <- function(x, y){
    NewColName <<- "a"
    z = ddply(x, y, summarize,
            Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE)
    )
    return(z)
}

myFunction(x=d.f,y=sv)

eval is looking in parent.frame(1). So if you instead define NewColName outside MyFunction it should work:

rm(NewColName)
NewColName <- "a"
myFunction <- function(x, y){

    z = ddply(x, y, summarize,
            Ave = mean(eval(parse(text=NewColName)), na.rm=TRUE)
    )
    return(z)
}
myFunction(x=d.f,y=sv)

By using get to pull out my.parse from the earlier environment, we can come much closer, but still have to pass curenv as a global:

myFunction <- function(x, y){
    NewColName <- "a"
    my.parse <- parse(text=NewColName)
    print(my.parse)
    curenv <<- environment()
    print(curenv)

    z = ddply(x, y, summarize,
            Ave = mean( eval( get("my.parse" , envir=curenv ) ), na.rm=TRUE)
    )
    return(z)
}

> myFunction(x=d.f,y=sv)
expression(a)
<environment: 0x0275a9b4>
  b Ave
1 0 1.5
2 1 3.5

I suspect that ddply is evaluating in the .GlobalEnv already, which is why all of the parent.frame() and sys.frame() strategies I tried failed.

Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235