10

I am using ddply to aggregate and summarize data frame variables, and I am interested in looping through my data frame's list to create the new variables.

new.data <- ddply(old.data, 
                  c("factor", "factor2"),
                  function(df)
                    c(a11_a10 = CustomFunction(df$a11_a10),
                      a12_a11 = CustomFunction(df$a12_a11),
                      a13_a12 = CustomFunction(df$a13_a12),
                      ...
                      ...
                      ...))

Is there a way for me to insert a loop in ddply so that I can avoid writing each new summary variable out, e.g.

for (i in 11:n) {
  paste("a", i, "_a", i - 1) = CustomFunction(..... )
}

I know that this is not how it would actually be done, but I just wanted to show how I'd conceptualize it. Is there a way to do this in the function I call in ddply, or via a list?

UPDATE: Because I'm a new user, I can't post an answer to my own question:

My answer involves ideas from Nick's answer and Ista's comment:

func <- function(old.data, min, max, gap) {
  varrange <- min:max
  usenames <- paste("a", varrange, "_a", varrange - gap, sep="")
  new.data <- ddply(old.data,
                    .(factor, factor2),
                    colwise(CustomFunction, c(usenames)))
}
Iris Tsui
  • 213
  • 2
  • 8
  • you are probably after `transform()` or `summarize()`. the help page for `summarize` shows some good examples. – Chase May 03 '11 at 18:41
  • @Chase - Re: summarize, I could do for (i in 11:n) with(old.data, summarize(old.data[, paste("a", i, "_a", i - 1, sep="")], llist(factor, factor2), CustomFunction)) – Iris Tsui May 03 '11 at 18:57
  • 1
    If you had made your example reproducible it would have made life easier for your potential helpers. In the absense of a working example, I can only guess that you are looking for `?colwise` (see the examples for use with ddply). – Ista May 03 '11 at 19:39
  • @ Ista - Thanks, colwise was exactly what I was looking for, after learning of Nick's initial loading of variable names into memory. – Iris Tsui May 03 '11 at 19:52
  • +1 @Casey. extremely elegant. – Ramnath May 03 '11 at 19:58

3 Answers3

7

Building on the excellent answer by @Nick, here is one approach to the problem

foo <- function(df){
  names   = paste("a", 11:n, "_a", 10:(n-1), sep = "")
  results = sapply(df[,names], CustomFunction)
}

new.data = ldply(dlply(old.data, c("factor", "factor2")), foo)

Here is an example application using the tips dataset in ggplot2. Suppose we want to calculate the average of tip and total_bill by combination of sex and smoker, here is how the code would work

foo = function(df){names = c("tip", "total_bill"); sapply(df[,names], mean)}
new = ldply(dlply(tips, c("sex", "smoker")), foo)

It produces the output shown below

         .id      tip total_bill
1  Female.No 2.773519   18.10519
2 Female.Yes 2.931515   17.97788
3    Male.No 3.113402   19.79124
4   Male.Yes 3.051167   22.28450

Is this what you were looking for?

Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • @ Ramnath- This is exactly what I am looking for EXCEPT I'd like the factor/ID variables to stay separated. I believe the answer I gave in my update will allow me to do this, but you neatly answered my question and made the example into a function that I will try to adapt. Thanks. – Iris Tsui May 03 '11 at 20:09
  • @Casey. Your answer is more elegant!! I upvoted it and would have given it +2 if I could have. Nice work – Ramnath May 03 '11 at 20:13
  • And all of this can be done in one line using `colwise`. See my answer. – Andrie May 03 '11 at 21:31
4

If I understand you correctly, you essentially want to apply a custom function to every column in the ddply data.frame.

The good news is there is a ddply function that does exactly that. This means the solution to your problem boils down to a one liner:

Building on the excellent example of @Ramnath:

library(ggplot2)
customfunction <- mean
ddply(tips, .(sex, smoker), numcolwise(customfunction))

     sex smoker total_bill      tip     size
1 Female     No   18.10519 2.773519 2.592593
2 Female    Yes   17.97788 2.931515 2.242424
3   Male     No   19.79124 3.113402 2.711340
4   Male    Yes   22.28450 3.051167 2.500000

The reason this works is that colwise turns a function that works on a vector into a function that works on a column in a data.frame. There are two variants of colwise: numcolwise works only on numeric columns, and catcolwise works on categorical columns. See?colwise for more information.

EDIT:

I appreciate that you may not wish to apply the function to all columns in your data.frame. Still, I find this syntax so easy, that my general approach would be to modify the data.frame that I pass to ddply. For example, the following modified example subsets tips to exclude some columns. The solution is still a one-liner:

ddply(tips[, -2], .(sex, smoker), numcolwise(customfunction))

     sex smoker total_bill     size
1 Female     No   18.10519 2.592593
2 Female    Yes   17.97788 2.242424
3   Male     No   19.79124 2.711340
4   Male    Yes   22.28450 2.500000
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • this works only if you are applying the function to all the columns other than those used for the split. If that is indeed the case for the OP then this would be the best solution. Else, I think the solution proposed by the OP is more general. – Ramnath May 03 '11 at 21:42
  • @Ramnath, Agreed, and good point. Still, in my workflow I would prefer to do a simple subset of the data.frame rather than a somewhat complicated bit of coding. I have edited my answer to reflect this. – Andrie May 03 '11 at 21:52
  • 2
    colwise has a `cols` argument which accepts a character vector of variable names... – hadley May 04 '11 at 01:16
3

In steps:

varrange<-11:n
usenames<-paste("a", varrange, "_a", varrange - 1, sep="")
results<-sapply(usenames, function(curname){CustomFunction(df[,curname])})
names(results)<-usenames

Is this what you want?

Nick Sabbe
  • 11,684
  • 1
  • 43
  • 57
  • thanks for your response, but it is not what I am looking for. I do wish to end up with a data frame that includes unique observations per each combination of "factor" and "factor2", and the output from my CustomFunction for each of my "a" variables for each unique combo of my two factors. – Iris Tsui May 03 '11 at 19:24
  • basically I am looking for the ddply functionality, but automating the variable creation using a loop or list approach. – Iris Tsui May 03 '11 at 19:27