8

EDIT: this is a dupe of How to implement coalesce efficiently in R, agreed. I didn't realize how my problem was more general than my specific application, so this discussion has been great.

Sometimes, the response variable in a randomized experiment is contained in a different column for each experimental group (Y_1 through Y_5 in the code below). It's often best to collect the response variable into a single column (Y_all). I end up doing it as in the example below. But I'm SURE there's a better way. thoughts?

set.seed(343)
N <- 1000
group <- sample(1:5, N, replace=TRUE)
Y_1 <- ifelse(group==1, rbinom(sum(group==1), 1, .5), NA)
Y_2 <- ifelse(group==2, rbinom(sum(group==2), 1, .5), NA)
Y_3 <- ifelse(group==3, rbinom(sum(group==3), 1, .5), NA)
Y_4 <- ifelse(group==4, rbinom(sum(group==4), 1, .5), NA)
Y_5 <- ifelse(group==5, rbinom(sum(group==5), 1, .5), NA)

## This is the part I want to make more efficient
Y_all <- ifelse(!is.na(Y_1), Y_1, 
                ifelse(!is.na(Y_2), Y_2, 
                       ifelse(!is.na(Y_3), Y_3, 
                              ifelse(!is.na(Y_4), Y_4, 
                                     ifelse(!is.na(Y_5), Y_5, 
                                            NA)))))

table(Y_all, Y_1, exclude = NULL)
table(Y_all, Y_2, exclude = NULL)
Community
  • 1
  • 1
Alex Coppock
  • 2,122
  • 3
  • 15
  • 31
  • 3
    For this particular application, finding the first non-missing value, the name of the SQL command is `coalesce`, and there's a very nice answer of [implementing SQL's coalesce in R](http://stackoverflow.com/q/19253820/903061). – Gregor Thomas May 06 '15 at 22:52
  • Please use `set.seed` – Frank May 06 '15 at 23:02
  • Gregor, this is exactly it: I'd never come across `Reduce` before. – Alex Coppock May 06 '15 at 23:09
  • 1
    I think I have a faster one for this particular case: `rowMeans( cbind(Y_1,Y_2,Y_3,Y_4,Y_5), na.rm=TRUE)` and `rowSums` would ahve worked as well if one can be assured that there is a max of 1 item per "row". – IRTFM May 06 '15 at 23:25

3 Answers3

5

I like to use a coalesce() function for this

#available from https://gist.github.com/MrFlick/10205794
coalesce<-function(...) {
    x<-lapply(list(...), function(z) {if (is.factor(z)) as.character(z) else z})
    m<-is.na(x[[1]])
    i<-2
    while(any(m) & i<=length(x)) {
        if ( length(x[[i]])==length(x[[1]])) {
            x[[1]][m]<-x[[i]][m]
        } else if (length(x[[i]])==1) {
            x[[1]][m]<-x[[i]]
        } else {
            stop(paste("length mismatch in argument",i," - found:", length( x[[i]] ),"expected:",length( x[[1]] ) ))
        }
        m<-is.na(x[[1]])
        i<-i+1
    }
    return(x[[1]])
}

Then you can do

Y_all <- coalesce(Y_1,Y_2,Y_3,Y_4,Y_5)

Of course, this is very specific to getting the first non-NA value.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Thanks Mr. Flick -- do you have a preference for this implementation over those in http://stackoverflow.com/questions/19253820/how-to-implement-coalesce-efficiently-in-r ? – Alex Coppock May 06 '15 at 23:11
  • No. You'd have to test to see what's best in your particular application. – MrFlick May 06 '15 at 23:13
  • This implementation is nice because it has a the factor conversion and error handling---the others are more bare-bones as yet. – Gregor Thomas May 06 '15 at 23:18
2

I think in this case you can use the melt function to convert the data to long format and then get rid of the missing values:

library(reshape2)

set.seed(10)
N <- 1000
group <- sample(1:5, N, replace=TRUE)
Y_1 <- ifelse(group==1, rbinom(sum(group==1), 1, .5), NA)
Y_2 <- ifelse(group==2, rbinom(sum(group==2), 1, .5), NA)
Y_3 <- ifelse(group==3, rbinom(sum(group==3), 1, .5), NA)
Y_4 <- ifelse(group==4, rbinom(sum(group==4), 1, .5), NA)
Y_5 <- ifelse(group==5, rbinom(sum(group==5), 1, .5), NA)

Y_all = data.frame(group, Y_1, Y_2,Y_3,Y_4,Y_5)

Y_all.m = melt(Y_all, id.var="group")
Y_all.m = Y_all.m[!is.na(Y_all.m$value),]
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • Should `identical(Y_all,Y_all.m$value)` be true? It seems the results have been re-sorted so that this is not the case. Hm, ditto for `identical(Y_all,Y_all.m$value[order(rownames(Y_all.m))])` – Frank May 06 '15 at 23:12
  • Y_all.m is a long version of Y_all, so they won't be identical. But you can confirm that they have the same values like this: `lapply(Y_all[,-1], table, exclude=NULL); tapply(Y_all.m$value, Y_all.m$variable, table, exclude=NULL)`. – eipi10 May 06 '15 at 23:17
  • Just occurred to me that you can also do: `identical(unname(unlist(Y_all[,-1])), Y_all.m$value)` (which yields `TRUE`). – eipi10 May 06 '15 at 23:28
  • Sorry -- I meant the OP's `Y_all` (not the data.frame that you give the same name here), which I took to be the desired output. I can see the counts are right in your result `with(Y_all.m,table(value,variable));table(Y_all,group)`, but I don't see something corresponding to the OP's `Y_all`; that's all. And I'm missing what the `melt` does for you. – Frank May 06 '15 at 23:30
1

Store the vectors in a matrix and then select:

Ymat    <- cbind(Y_1,Y_2,Y_3,Y_4,Y_5)
mycol   <- apply(!is.na(Ymat),1,which)

Y_all.f <- Ymat[cbind(1:nrow(Ymat),mycol)]

identical(Y_all,Y_all.f) # TRUE
Frank
  • 66,179
  • 8
  • 96
  • 180