R: Updating a data frame with another data frame

Question

Let's say our initial data frame looks like this:

df1 = data.frame(Index=c(1:6),A=c(1:6),B=c(1,2,3,NA,NA,NA),C=c(1,2,3,NA,NA,NA))

> df1
  Index A  B  C
1     1 1  1  1
2     2 2  2  2
3     3 3  3  3
4     4 4 NA NA
5     5 5 NA NA
6     6 6 NA NA

Another data frame contains new information for col B and C

df2 = data.frame(Index=c(4,5,6),B=c(4,4,4),C=c(5,5,5))

> df2
  Index B C
1     4 4 5
2     5 4 5
3     6 4 5

How can you update the missing values in df1 so it looks like this:

  Index A B C
1     1 1 1 1
2     2 2 2 2
3     3 3 3 3
4     4 4 4 5
5     5 5 4 5
6     6 6 4 5

My attempt:

library(dplyr)

> full_join(df1,df2)
Joining by: c("Index", "B", "C")
  Index  A  B  C
1     1  1  1  1
2     2  2  2  2
3     3  3  3  3
4     4  4 NA NA
5     5  5 NA NA
6     6  6 NA NA
7     4 NA  4  5
8     5 NA  4  5
9     6 NA  4  5

Which as you can see has created duplicate rows for the 4,5,6 index instead of replacing the NA values.

Any help would be greatly appreciated!

thelatemail · Accepted Answer · 2016-05-19T03:07:05.440

6

merge then aggregate:

aggregate(. ~ Index, data=merge(df1, df2, all=TRUE), na.omit, na.action=na.pass )

#  Index B C A
#1     1 1 1 1
#2     2 2 2 2
#3     3 3 3 3
#4     4 4 5 4
#5     5 4 5 5
#6     6 4 5 6

Or in dplyr speak:

df1 %>% 
    full_join(df2) %>%
    group_by(Index) %>%
    summarise_each(funs(na.omit))

#Joining by: c("Index", "B", "C")
#Source: local data frame [6 x 4]
#
#  Index     A     B     C
#  (dbl) (int) (dbl) (dbl)
#1     1     1     1     1
#2     2     2     2     2
#3     3     3     3     3
#4     4     4     4     5
#5     5     5     4     5
#6     6     6     4     5

edited May 19 '16 at 03:07

answered May 19 '16 at 02:55

thelatemail

91,185
12
128
188

Thanks, it worked great! I really need to learn more about the aggregate function. Is there also a way to update entries that aren't necessarily NA? – Zyferion May 19 '16 at 06:38
I tried the aggregate option where the master file (MF) had 9634 obs of 430 variables and the update file (UF) had 9249 obs of 4 identically named variables. The result file has 9224 obs of 430 variables (So considerable loss of masterfile data). The dplyr option (which was MUCH slower) [warning summarise_each() deprecated in dplyr 0.7.0] - and in fact never completed. So be warned you will lose NA data from your master file with this solution. – Peter King Sep 05 '22 at 22:27

score 5 · Answer 2 · answered May 19 '16 at 02:18

5

We can use join from data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), join on with 'df1' using "Index" and assign (:=), the values in 'B' and 'C' with 'i.B' and 'i.C'.

library(data.table)
setDT(df1)[df2, c('B', 'C') := .(i.B, i.C), on = "Index"]
df1
#   Index A B C
#1:     1 1 1 1
#2:     2 2 2 2
#3:     3 3 3 3
#4:     4 4 4 5
#5:     5 5 4 5
#6:     6 6 4 5

answered May 19 '16 at 02:18

akrun

874,273
37
540
662

Hi akrun, I couldn't get this solution to work for me – Zyferion May 19 '16 at 02:37
@Zyferion My solution was based on the example you provided. It works for me with data.table 1.9.6. Also, as your comment was not very informative (i.e. not providing any error message or anything), I can't comment. – akrun May 19 '16 at 02:38
Error in `[.data.table`(setDT(df1), df2, `:=`(c("B", "C"), .(i.B, i.C)), : unused argument (on = "Index") – Zyferion May 19 '16 at 02:45
4

@Zyferion It could be because you are using an older version of `data.table`. Please update to new version – akrun May 19 '16 at 02:45

LMc · Answer 3 · 2022-06-27T22:02:19.467

4

As of dplyr >= 1.0.0 you can use rows_update:

library(dplyr)

df1 %>% 
  rows_update(df2, by = "Index")

  Index A B C
1     1 1 1 1
2     2 2 2 2
3     3 3 3 3
4     4 4 4 5
5     5 5 4 5
6     6 6 4 5

Alternatively, there is rows_patch:

rows_patch() works like rows_update() but only overwrites NA values.

edited Jun 27 '22 at 22:02

answered May 11 '22 at 16:54

LMc

12,577
3
31
43

1

This only work if the both data frame has same columns – ok1more Oct 06 '22 at 18:31

score 3 · Answer 4 · answered May 20 '16 at 08:04

For those interested, I've extended this problem to:

- handle updating a data frame with another data frame with new columns

- replace any existing entries regardless if they're NA or not.

Heres the solution I found using the aggregate function from @thelatemail :)

df1 = data.frame(Index=c(1:6),A=c(1:6),B=c(1,2,3,3,3,3),C=c(1,2,3,3,3,3))

df2 = data.frame(Index=c(4,5,6),B=c(4,4,4),C=c(5,5,5),D=c(6,6,6),E=c(7,7,7))

df3 = full_join(df1,df2)

# Create a function na.omit.last 
na.omit.last = function(x){
  x <- na.omit(x)
  x <- last(x)
}

# For the columns not in df1 
dfA = aggregate(. ~ Index, df3, na.omit,na.action = na.pass)
dfA = dfA[,-(1:ncol(df1))] 
dfA = data.frame(lapply(dfA,as.numeric))

dfB = aggregate(. ~ Index, df3[,1:ncol(df1)], na.omit.last, na.action = na.pass)

# If there are more columns in df2 append dfA
if (ncol(df2) > ncol(df1)) {
  df3 = cbind(dfB,dfA)
}  else {
    df3 = dfB
  }

print(df3)

score 2 · Answer 5 · answered May 19 '16 at 01:59

2

Not sure what the general case or conditions would be, but this works for this instance without dplyr

df3 <- as.matrix(df1)
df3[which(is.na(df3))] <- as.matrix(df2)
df3 <- as.data.frame(df3)
df3

  A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 5
5 5 4 5
6 6 4 5

answered May 19 '16 at 01:59

TomNash

3,147
2
21
57

thanks this could be useful - i've updated the question and added an index column though – Zyferion May 19 '16 at 02:16

R: Updating a data frame with another data frame

5 Answers5

For those interested, I've extended this problem to:

Linked

Related