I looked online extensively and did not see an answer to this particular question (I think).
The best way for me to explain myself will be with some code that replicates my problem. I made some temp data:
x <- runif(100,1,2)
y <- runif(100,2,3)
z <- c(rep(1,100))
temp <- cbind(x,y,z)
temp[1:25,3] = temp[1:25,3] +2
temp <- as.data.frame(temp)
And this is what temp looks like
x y z
1 1.512620 2.552271 3
2 1.133614 2.455296 3
3 1.543242 2.490120 3
4 1.047618 2.069474 3
. . . .
. . . .
27 1.859012 2.687665 1
28 1.231450 2.196395 1
and it continue on until the end of the data frame (100 rows).
What I want to do is apply a function to the data frame BUT to subsets of the data. So, for example, I want to apply the function mean to the columns x and y for when z=3 and apply the function mean to the columns x and y for when z=1. So I would end up with 4 values: the mean of x when z=1 and when z=3 and the mean of y when z=1 and z=3. For my actual dataset the number of rows for when z= some value varies a lot.
I have been using the following code which does work; however, it makes me feel very uneasy since I feel like the code could be more efficient AND ideally avoid a for loop.
x <- c(unique(temp$z))
I use that ^^ to get the unique z values (in this case z=3 and z=1).
for(i in x){
assign(paste("newdata",i,sep=""),subset(temp[which(temp$z==i),],select=c("x","y")))
}
So I now have two new data frames newdata1 and newdata3 that don't have the same number of rows. newdata1 has all the values when z=1 and newdata3 has all the values when z=3.
library(gdata)
blah <-cbindX(newdata1,newdata3)
I use cbindX to combine the subsetted data into one large data frame again. I am not sure why I do this exactly (I made this code a long time ago). All I remember is this is the only way I could get it to work when I use the for loop above. The main problem with the code is when I have multiple z values then manually typing in that list becomes very cumbersome. If z ranged from 1 to 50 then a user would type in newdata1, newdata2, newdata3 .... etc.
But... it does work:
summ.test <- apply(blah,2,function(x) {
c(min(x,na.rm=TRUE),median(x,na.rm=TRUE),max(x,na.rm=TRUE),sum(!is.na(x)))})
x y x y
[1,] 1.028332 2.018162 1.012379 2.009595
[2,] 1.509049 2.504000 1.427981 2.455296
[3,] 1.992704 2.998483 1.978359 2.970695
[4,] 75.000000 75.000000 25.000000 25.000000
So what I effectively did is create a new data frame with the values I subsetted from before and applied the functions of interest to them. So the first row is: mean of x when z=1, mean of y when z=1, mean of x when z=3, mean of y when z=3.
Main problems that should be fairly obvious: the for loop method to subset the data frame causes more problems then I'd hope. Any recommendations to avoid that entirely and still end up with the same result?
Please let me know if any of this is confusing or if I my code is just plain sloppy! Still working at formatting questions on this site also..