5

Let's say our initial data frame looks like this:

df1 = data.frame(Index=c(1:6),A=c(1:6),B=c(1,2,3,NA,NA,NA),C=c(1,2,3,NA,NA,NA))

> df1
  Index A  B  C
1     1 1  1  1
2     2 2  2  2
3     3 3  3  3
4     4 4 NA NA
5     5 5 NA NA
6     6 6 NA NA

Another data frame contains new information for col B and C

df2 = data.frame(Index=c(4,5,6),B=c(4,4,4),C=c(5,5,5))

> df2
  Index B C
1     4 4 5
2     5 4 5
3     6 4 5

How can you update the missing values in df1 so it looks like this:

  Index A B C
1     1 1 1 1
2     2 2 2 2
3     3 3 3 3
4     4 4 4 5
5     5 5 4 5
6     6 6 4 5

My attempt:

library(dplyr)

> full_join(df1,df2)
Joining by: c("Index", "B", "C")
  Index  A  B  C
1     1  1  1  1
2     2  2  2  2
3     3  3  3  3
4     4  4 NA NA
5     5  5 NA NA
6     6  6 NA NA
7     4 NA  4  5
8     5 NA  4  5
9     6 NA  4  5

Which as you can see has created duplicate rows for the 4,5,6 index instead of replacing the NA values.

Any help would be greatly appreciated!

Zyferion
  • 149
  • 2
  • 11

5 Answers5

6

merge then aggregate:

aggregate(. ~ Index, data=merge(df1, df2, all=TRUE), na.omit, na.action=na.pass )

#  Index B C A
#1     1 1 1 1
#2     2 2 2 2
#3     3 3 3 3
#4     4 4 5 4
#5     5 4 5 5
#6     6 4 5 6

Or in dplyr speak:

df1 %>% 
    full_join(df2) %>%
    group_by(Index) %>%
    summarise_each(funs(na.omit))

#Joining by: c("Index", "B", "C")
#Source: local data frame [6 x 4]
#
#  Index     A     B     C
#  (dbl) (int) (dbl) (dbl)
#1     1     1     1     1
#2     2     2     2     2
#3     3     3     3     3
#4     4     4     4     5
#5     5     5     4     5
#6     6     6     4     5
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Thanks, it worked great! I really need to learn more about the aggregate function. Is there also a way to update entries that aren't necessarily NA? – Zyferion May 19 '16 at 06:38
  • I tried the aggregate option where the master file (MF) had 9634 obs of 430 variables and the update file (UF) had 9249 obs of 4 identically named variables. The result file has 9224 obs of 430 variables (So considerable loss of masterfile data). The dplyr option (which was MUCH slower) [warning summarise_each() deprecated in dplyr 0.7.0] - and in fact never completed. So be warned you will lose NA data from your master file with this solution. – Peter King Sep 05 '22 at 22:27
5

We can use join from data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), join on with 'df1' using "Index" and assign (:=), the values in 'B' and 'C' with 'i.B' and 'i.C'.

library(data.table)
setDT(df1)[df2, c('B', 'C') := .(i.B, i.C), on = "Index"]
df1
#   Index A B C
#1:     1 1 1 1
#2:     2 2 2 2
#3:     3 3 3 3
#4:     4 4 4 5
#5:     5 5 4 5
#6:     6 6 4 5
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hi akrun, I couldn't get this solution to work for me – Zyferion May 19 '16 at 02:37
  • @Zyferion My solution was based on the example you provided. It works for me with data.table 1.9.6. Also, as your comment was not very informative (i.e. not providing any error message or anything), I can't comment. – akrun May 19 '16 at 02:38
  • Error in `[.data.table`(setDT(df1), df2, `:=`(c("B", "C"), .(i.B, i.C)), : unused argument (on = "Index") – Zyferion May 19 '16 at 02:45
  • 4
    @Zyferion It could be because you are using an older version of `data.table`. Please update to new version – akrun May 19 '16 at 02:45
4

As of dplyr >= 1.0.0 you can use rows_update:

library(dplyr)

df1 %>% 
  rows_update(df2, by = "Index")

  Index A B C
1     1 1 1 1
2     2 2 2 2
3     3 3 3 3
4     4 4 4 5
5     5 5 4 5
6     6 6 4 5

Alternatively, there is rows_patch:

rows_patch() works like rows_update() but only overwrites NA values.

LMc
  • 12,577
  • 3
  • 31
  • 43
3

For those interested, I've extended this problem to:

- handle updating a data frame with another data frame with new columns

- replace any existing entries regardless if they're NA or not.

Heres the solution I found using the aggregate function from @thelatemail :)

df1 = data.frame(Index=c(1:6),A=c(1:6),B=c(1,2,3,3,3,3),C=c(1,2,3,3,3,3))

df2 = data.frame(Index=c(4,5,6),B=c(4,4,4),C=c(5,5,5),D=c(6,6,6),E=c(7,7,7))

df3 = full_join(df1,df2)

# Create a function na.omit.last 
na.omit.last = function(x){
  x <- na.omit(x)
  x <- last(x)
}

# For the columns not in df1 
dfA = aggregate(. ~ Index, df3, na.omit,na.action = na.pass)
dfA = dfA[,-(1:ncol(df1))] 
dfA = data.frame(lapply(dfA,as.numeric))

dfB = aggregate(. ~ Index, df3[,1:ncol(df1)], na.omit.last, na.action = na.pass)

# If there are more columns in df2 append dfA
if (ncol(df2) > ncol(df1)) {
  df3 = cbind(dfB,dfA)
}  else {
    df3 = dfB
  }

print(df3)
Zyferion
  • 149
  • 2
  • 11
2

Not sure what the general case or conditions would be, but this works for this instance without dplyr

df3 <- as.matrix(df1)
df3[which(is.na(df3))] <- as.matrix(df2)
df3 <- as.data.frame(df3)
df3

  A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 5
5 5 4 5
6 6 4 5
TomNash
  • 3,147
  • 2
  • 21
  • 57