R applying a function to a subset of a data frame

Question

I looked online extensively and did not see an answer to this particular question (I think).

The best way for me to explain myself will be with some code that replicates my problem. I made some temp data:

x <- runif(100,1,2)
y <- runif(100,2,3)

z <- c(rep(1,100))
temp <- cbind(x,y,z)

temp[1:25,3] = temp[1:25,3] +2

temp <- as.data.frame(temp)

And this is what temp looks like

         x        y   z
1   1.512620 2.552271 3
2   1.133614 2.455296 3
3   1.543242 2.490120 3
4   1.047618 2.069474 3
.      .        .     .
.      .        .     .
27  1.859012 2.687665 1
28  1.231450 2.196395 1

and it continue on until the end of the data frame (100 rows).

What I want to do is apply a function to the data frame BUT to subsets of the data. So, for example, I want to apply the function mean to the columns x and y for when z=3 and apply the function mean to the columns x and y for when z=1. So I would end up with 4 values: the mean of x when z=1 and when z=3 and the mean of y when z=1 and z=3. For my actual dataset the number of rows for when z= some value varies a lot.

I have been using the following code which does work; however, it makes me feel very uneasy since I feel like the code could be more efficient AND ideally avoid a for loop.

x <- c(unique(temp$z))

I use that ^^ to get the unique z values (in this case z=3 and z=1).

for(i in x){
  assign(paste("newdata",i,sep=""),subset(temp[which(temp$z==i),],select=c("x","y")))
}

So I now have two new data frames newdata1 and newdata3 that don't have the same number of rows. newdata1 has all the values when z=1 and newdata3 has all the values when z=3.

library(gdata)

blah <-cbindX(newdata1,newdata3)

I use cbindX to combine the subsetted data into one large data frame again. I am not sure why I do this exactly (I made this code a long time ago). All I remember is this is the only way I could get it to work when I use the for loop above. The main problem with the code is when I have multiple z values then manually typing in that list becomes very cumbersome. If z ranged from 1 to 50 then a user would type in newdata1, newdata2, newdata3 .... etc.

But... it does work:

summ.test <- apply(blah,2,function(x) { 
c(min(x,na.rm=TRUE),median(x,na.rm=TRUE),max(x,na.rm=TRUE),sum(!is.na(x)))})

         x         y         x         y
[1,]  1.028332  2.018162  1.012379  2.009595
[2,]  1.509049  2.504000  1.427981  2.455296
[3,]  1.992704  2.998483  1.978359  2.970695
[4,] 75.000000 75.000000 25.000000 25.000000

So what I effectively did is create a new data frame with the values I subsetted from before and applied the functions of interest to them. So the first row is: mean of x when z=1, mean of y when z=1, mean of x when z=3, mean of y when z=3.

Main problems that should be fairly obvious: the for loop method to subset the data frame causes more problems then I'd hope. Any recommendations to avoid that entirely and still end up with the same result?

Please let me know if any of this is confusing or if I my code is just plain sloppy! Still working at formatting questions on this site also..

Ricardo Saporta · Answer 1 · 2013-10-09T01:11:21.117

3

library(data.table)
DT <- as.data.table(temp)
DT[, lapply(.SD, mean), by=z]
   z        x        y
1: 3 1.515801 2.309161
2: 1 1.509637 2.532575

Or using base R:

with(temp, cbind(x=tapply(x, z, mean), y=tapply(y, z, mean)))
         x        y
1 1.509637 2.532575
3 1.515801 2.309161

PS, dont forget to set a seed, set.seed(1) for examples ;)

edited Oct 09 '13 at 01:11

answered Oct 09 '13 at 01:05

Ricardo Saporta

54,400
17
144
178

This also works well. Thanks for the answer and the tip! – John Lombardi Oct 09 '13 at 01:42

IRTFM · Accepted Answer · 2013-10-09T01:27:48.647

2

> aggregate( . ~ z, data=temp, FUN=mean)
  z        x        y
1 1 1.505304 2.474642
2 3 1.533418 2.477191

When you will be applying the same function to multiple columns within categories of another column think about 'aggregate'. This is the version taht takes a formula argument where the "dot" before the tilde says to get the mean of all of the columns besides "z".

edited Oct 09 '13 at 01:27

answered Oct 09 '13 at 01:18

IRTFM

258,963
21
364
487

Excellent. Thank you. The aggregate seems to work best for the data set especially since I know I will be doing more extensive functions later. This seems to work `aggregate( x ~ z, data=temp, FUN=function(x) c(mn =min(x,na.rm=TRUE), sum(!is.na(x))))` for multiple functions. I ideally want to do a regression for a bunch of different number of subsetted values, so I will see if I can tailor aggregate do that as well. Thank you. – John Lombardi Oct 09 '13 at 01:34

Fernando · Answer 3 · 2013-10-09T01:46:34.953

What I want to do is apply a function to the data frame BUT to subsets of the data.

So you can use the subset or aggregate function:

data = data.frame(x = runif(100), y = runif(100), z = sample(1:10, 100, replace = TRUE))

# get z > 3 AND z < 6 subset, for example
new.data = subset(data, z > 3 & z < 6) ## CAUTION: use &, not &&

# use newdata
cm = colMeans(new.data)
print(cm)

#        x         y         z 
#   0.4674450 0.5293915 4.5769231

Hope it helps!

R applying a function to a subset of a data frame

3 Answers3

Linked