3

I looked online extensively and did not see an answer to this particular question (I think).

The best way for me to explain myself will be with some code that replicates my problem. I made some temp data:

x <- runif(100,1,2)
y <- runif(100,2,3)

z <- c(rep(1,100))
temp <- cbind(x,y,z)

temp[1:25,3] = temp[1:25,3] +2

temp <- as.data.frame(temp)

And this is what temp looks like

         x        y   z
1   1.512620 2.552271 3
2   1.133614 2.455296 3
3   1.543242 2.490120 3
4   1.047618 2.069474 3
.      .        .     .
.      .        .     .
27  1.859012 2.687665 1
28  1.231450 2.196395 1

and it continue on until the end of the data frame (100 rows).

What I want to do is apply a function to the data frame BUT to subsets of the data. So, for example, I want to apply the function mean to the columns x and y for when z=3 and apply the function mean to the columns x and y for when z=1. So I would end up with 4 values: the mean of x when z=1 and when z=3 and the mean of y when z=1 and z=3. For my actual dataset the number of rows for when z= some value varies a lot.

I have been using the following code which does work; however, it makes me feel very uneasy since I feel like the code could be more efficient AND ideally avoid a for loop.

x <- c(unique(temp$z))

I use that ^^ to get the unique z values (in this case z=3 and z=1).

for(i in x){
  assign(paste("newdata",i,sep=""),subset(temp[which(temp$z==i),],select=c("x","y")))
} 

So I now have two new data frames newdata1 and newdata3 that don't have the same number of rows. newdata1 has all the values when z=1 and newdata3 has all the values when z=3.

library(gdata)

blah <-cbindX(newdata1,newdata3)

I use cbindX to combine the subsetted data into one large data frame again. I am not sure why I do this exactly (I made this code a long time ago). All I remember is this is the only way I could get it to work when I use the for loop above. The main problem with the code is when I have multiple z values then manually typing in that list becomes very cumbersome. If z ranged from 1 to 50 then a user would type in newdata1, newdata2, newdata3 .... etc.

But... it does work:

summ.test <- apply(blah,2,function(x) { 
c(min(x,na.rm=TRUE),median(x,na.rm=TRUE),max(x,na.rm=TRUE),sum(!is.na(x)))})

         x         y         x         y
[1,]  1.028332  2.018162  1.012379  2.009595
[2,]  1.509049  2.504000  1.427981  2.455296
[3,]  1.992704  2.998483  1.978359  2.970695
[4,] 75.000000 75.000000 25.000000 25.000000

So what I effectively did is create a new data frame with the values I subsetted from before and applied the functions of interest to them. So the first row is: mean of x when z=1, mean of y when z=1, mean of x when z=3, mean of y when z=3.

Main problems that should be fairly obvious: the for loop method to subset the data frame causes more problems then I'd hope. Any recommendations to avoid that entirely and still end up with the same result?

Please let me know if any of this is confusing or if I my code is just plain sloppy! Still working at formatting questions on this site also..

Frank
  • 66,179
  • 8
  • 96
  • 180
John Lombardi
  • 171
  • 3
  • 3
  • 10

3 Answers3

3
library(data.table)
DT <- as.data.table(temp)
DT[, lapply(.SD, mean), by=z]
   z        x        y
1: 3 1.515801 2.309161
2: 1 1.509637 2.532575

Or using base R:

with(temp, cbind(x=tapply(x, z, mean), y=tapply(y, z, mean)))
         x        y
1 1.509637 2.532575
3 1.515801 2.309161

PS, dont forget to set a seed, set.seed(1) for examples ;)

Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
2
> aggregate( . ~ z, data=temp, FUN=mean)
  z        x        y
1 1 1.505304 2.474642
2 3 1.533418 2.477191

When you will be applying the same function to multiple columns within categories of another column think about 'aggregate'. This is the version taht takes a formula argument where the "dot" before the tilde says to get the mean of all of the columns besides "z".

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Excellent. Thank you. The aggregate seems to work best for the data set especially since I know I will be doing more extensive functions later. This seems to work `aggregate( x ~ z, data=temp, FUN=function(x) c(mn =min(x,na.rm=TRUE), sum(!is.na(x))))` for multiple functions. I ideally want to do a regression for a bunch of different number of subsetted values, so I will see if I can tailor aggregate do that as well. Thank you. – John Lombardi Oct 09 '13 at 01:34
1

What I want to do is apply a function to the data frame BUT to subsets of the data.

So you can use the subset or aggregate function:

data = data.frame(x = runif(100), y = runif(100), z = sample(1:10, 100, replace = TRUE))

# get z > 3 AND z < 6 subset, for example
new.data = subset(data, z > 3 & z < 6) ## CAUTION: use &, not &&

# use newdata
cm = colMeans(new.data)
print(cm)

#        x         y         z 
#   0.4674450 0.5293915 4.5769231 

Hope it helps!

Fernando
  • 7,785
  • 6
  • 49
  • 81