1

I have got a huge dataframe with two columns that I need to combine. On may wonder if this isn't just the same problem mentionend in this post here, but my dataframe in fact doesn' exist of just two columns. There are many more (wit a lot of NA's) and I only want to deal with two of them, which maked it a bit more complicated I guess.

This is a piece of those columns:

    col1 col2
15   NA    4
16   NA    5
17   NA    5
18   NA    5
19   NA    1
20   NA   NA
21   NA   NA
22   1    NA
23   5    NA
24   2    NA
25   4    NA
26   3    NA
27   NA    2
28   NA    4
29   NA    5
30   NA    3

I need to combine the columns into one by replacing eachother's NA's. No preplacements take place when both columns contain an NA (but that's obvious).

The result should look like this:

    col1
15   4    
16   5  
17   5  
18   5  
19   1  
20   NA 
21   NA 
22   1  
23   5  
24   2  
25   4  
26   3  
27   2  
28   4  
29   5  
30   3  

I tried an ugly for loop:

for (i in 1:nrow(df)){
  if (is.na(df[i,1])==TRUE){
    df[i,1] <- df[i,2]
  }
 }
df <- df[,1]

but that code appeared to be way too slow. Does anyone have an idea about how two handle this problem?

Many thanks in advance!

Community
  • 1
  • 1
rdatasculptor
  • 8,112
  • 14
  • 56
  • 81
  • this might be simple if your gonna use sql to update the column 1 – Viscocent May 09 '14 at 13:25
  • possible duplicate of [Combine/merge columns while avoiding NA?](http://stackoverflow.com/questions/22106132/combine-merge-columns-while-avoiding-na) – Henrik May 09 '14 at 13:34
  • possible duplicate of [How to implement coalesce efficiently in R](http://stackoverflow.com/questions/19253820/how-to-implement-coalesce-efficiently-in-r) – Brian Diggs May 09 '14 at 15:25

2 Answers2

3

No loop needed since you can assign to subsets:

df$col1[is.na(df$col1)] <- df$col2[is.na(df$col1)]
Roland
  • 127,288
  • 10
  • 191
  • 288
  • are you sure this should work? I got this message: "number of items to replace is not a multiple of replacement length" – rdatasculptor May 09 '14 at 13:30
  • Make sure that you have the same code inside the square brackets on each side, as in Roland's example. i.e. col1 twice, not col1 on the left and col2 on the right. – ping May 09 '14 at 13:33
1

Languages like SQL have a function called coalesce which returns the first non-missing value given a set of lists. I wrote a function that does this behavior in R.

coalesce<-function(...) {
    x<-lapply(list(...), function(z) {if (is.factor(z)) as.character(z) else z})
    m<-is.na(x[[1]])
    i<-2
    while(any(m) & i<=length(x)) {
        if ( length(x[[i]])==length(x[[1]])) {
            x[[1]][m]<-x[[i]][m]
        } else if (length(x[[i]])==1) {
            x[[1]][m]<-x[[i]]
        } else {
            stop(paste("length mismatch in argument",i," - found:", length( x[[i]] ),"expected:",length( x[[1]] ) ))
        }
        m<-is.na(x[[1]])
        i<-i+1
    }
    return(x[[1]])
}

And you would use it like

coalesce(col1,col2)

And you can also add a default value if all rows are NA

coalesce(col1,col2, -9)

It returns a new column rather than modifying any of the originals.

I try to keep the latest version of the function on this gist

MrFlick
  • 195,160
  • 17
  • 277
  • 295