Use merge() to update a data frame with values from a second data frame

Question

I'm trying to figure out how to use merge() to update a data frame.

Take for example the data frame foo

foo <- data.frame(index=c('a', 'b', 'c', 'd'), value=c(100, 101, NA, NA))

Which has the following values

index value
1     a   100
2     b   101
3     c    NA
4     d    NA

And the data frame bar

bar <- data.frame(index=c('c', 'd'), value=c(200, 201))

Which has the following values:

 index value
1     c   200
2     d   201

When I run the following merge() function to update the values for c and d

merge(foo, bar, by='index', all=T)

It results in this output:

 index value.x value.y
1     a     100      NA
2     b     101      NA
3     c      NA     200
4     d      NA     201

I'd like the output of merge() to avoid the creation of, in this specific example, of value.x and value.y but only retain the original column of value Is there a simple way of doing this?

Did you ever get answer to this question? I am looking for a solution for this same problem. — dataanalyst, Jul 06 '16 at 16:41
I wonder too why merge does not have, say an `overwrite=TRUE` parameter which would kick in when `by` is provided. It is incovienent to delete columns manually every time you want to want to rerun a merge. — Valentas, Aug 15 '17 at 06:35
See also: [Replace missing values (NA) in one data set with values from another where columns match](https://stackoverflow.com/questions/32638845/replace-missing-values-na-in-one-data-set-with-values-from-another-where-colum) — Henrik, Feb 07 '19 at 12:33
I just had the same issue and I think the most direct answer to your question is the answer by @jangorecki below which you should accept — Jakob, May 04 '21 at 10:48

score 16 · Answer 1 · answered Feb 07 '19 at 07:13

16

The optimal solution using data.table

library(data.table)
setDT(foo)
setDT(bar)
foo[bar, on="index", value:=i.value]
foo
#   index value
#1:     a   100
#2:     b   101
#3:     c   200
#4:     d   201

first argument in [ data.table method is named i thus we can refer to column from table in i argument using i. prefix.

answered Feb 07 '19 at 07:13

jangorecki

16,384
4
79
160

2

How would you do this if there were multiple columns that needed to be updated, say value1, value2, etc.? – harisf Sep 02 '21 at 13:22
This data.table solution seems to be an ideal replacement for VLOOKUP. Merging or joining dataframes in R does not allow individual values to be updated, which is how this function is often used in excel. – Apr 13 '22 at 12:02
@harisf, I would do this: ```foo[bar, on="index", `:=` (value1=i.value1, value2=i.value2)]``` – Valentin_Ștefan Apr 06 '23 at 14:08

apeescape · Answer 2 · 2010-07-06T22:03:58.570

12

Doesn't merge() always bind columns together? Does replace() work?

foo$value <- replace(foo$value, foo$index %in% bar$index, bar$value)

or match() so the order matters

foo$value[match(bar$index, foo$index)] <- bar$value

edited Jul 06 '10 at 22:03

answered Jul 06 '10 at 21:27

apeescape

1,109
7
10

2

One wrinkle with using `replace()` is that if the ordering in `bar` is not the same as in `foo`, it won't work properly. For example, if you try running the above example after `bar <- bar[c(2,1),]`, the end result does not come out correct. – andrewj Jul 06 '10 at 21:46
Yes, `match()` does work for my example. In reality, it turns out that my actual use case is more complicated, where I would like to match across multiple columns and not just a simple vector. I don't think `match()` works when you would like to match across multiple columns of a dataframe. – andrewj Jul 07 '10 at 18:12
Thank you! the idea to use the match() is good... however, if bar is to have another element that is not contained in foo (we want to update and add the new stuff) bar <- data.frame(index=c('c', 'd','e'), value=c(200, 201,215)) Then when we try to use match, we get an error. Error in foo$value[match(bar$index, foo$index)] <- bar$value : NAs are not allowed in subscripted assignments Any ideas how to overcome that? – moldovean Jan 08 '13 at 13:16
1

What if you have multiple columns for indices? – wolfsatthedoor Aug 18 '16 at 20:10

score 4 · Answer 3 · edited May 23 '17 at 12:09

I would also like to present an sql-solution using library sqldf and the R integrated sqlite-database. I like the simplicity, accuratness and power of sql.
Accurateness: since I can exactly define which object=rows I want to change without considering the order of a data.frame (foo.id = bar.id).
Power: in WHERE after SET and WHERE (third row) I can define all conditions I want to consider to update.
Simplicity: the syntax is more readable than using index in vectors, matrix or dataframes.

library(sqldf)

# I changed index to id since index does not work. 
#   Obviously index is a key word in sqlite.

(foo <- data.frame(id=c('a', 'b', 'c', 'd'), value=c(100, 101, NA, NA)))
(bar <- data.frame(id=c('c', 'd'), value=c(200, 201)))

sqldf(c(paste("UPDATE foo"
             ," SET value = (SELECT bar.value FROM bar WHERE foo.id = bar.id)"
             ," WHERE value IS NULL"
             )
        , " SELECT * FROM main.foo"
    )
)

Which gives

  id value
1  a   100
2  b   101
3  c   200
4  d   201

Similar issues:
r equivalent of sql update?
R sqlite: update with two tables

The SQL statement can run over multiple lines so `paste` is not needed. — G. Grothendieck, Aug 17 '16 at 20:42

score 0 · Answer 4 · answered Jul 06 '10 at 21:34

merge() only merges in new data. For instance, if you had a data set of average income for a few cities, and a separate data set of the populations of those cities, you would use merge() to merge in one set of data into the other.

Like apeescape said, replace() is probably what you want.

score 0 · Answer 5 · edited Jan 16 '17 at 05:26

Another approach could be:

Remove the NAs from the first data fram
Use rbind to append the data instead of using merge:

These are the original two data frames:

foo <- data.frame(index=c('a', 'b', 'c', 'd'), value=c(100, 101, NA, NA))
bar <- data.frame(index=c('c', 'd'), value=c(200, 201))

(1) Use the negation of is.na to remove the NAs:

foo_new <- foo[!is.na(foo$value),]

(2) Bind the data frames and you'll get the answer you were looking for

new_df <- rbind(foo_new,bar)

            new_df
            index value
            1     a   100
            2     b   101
            3     c   200
            4     d   201

score 0 · Answer 6 · answered Sep 27 '17 at 10:12

I think the most simple way is to "mark" the value which need to be update prior to the merge.

bar$update <- TRUE
foo <- merge(foo, bar, by='index', all=T, suffixes=c("",".update"))
foo[!is.na(foo$update),]$value <- foo[!is.na(foo$update),]$value.update
foo$value.update <- NULL
foo$update <- NULL

It would be faster using 'data.table'

library(data.table)
foo <- as.data.table(foo)
bar <- as.data.table(bar)
bar[, update:=TRUE]
foo <- merge(foo, bar, by='index', all=T, suffixes=c("",".update"))
foo[!is.na(update),value:=value.update]
foo[, c("value.update","update"):=NULL]
foo

   index value
1:     a   100
2:     b   101
3:     c   200
4:     d   201

score 0 · Answer 7 · answered May 17 '23 at 14:55

I was facing a similar issue but this is still a specific case as compared to mine. I had 2 dataframes where left one was the master dataframe (like foo in your case) and the right one was minor dataframe with updated values (like bar in your case). Now I had to update the values in left dataframe from the ones in right dataframe only the values which were differing. It is somewhat similar to your issue but people have answered it with respect to NAs.

For more generic solution that is applicable to multiple rows and multiple columns with non NA values.

foo <- data.frame(index=c('a', 'b', 'c', 'd'), value=c(100, 101, NA, NA))
bar <- data.frame(index=c('c', 'd'), value=c(200, 201))

ModifiedIndexs <-
  foo %>% 
  full_join(bar) %>% 
  group_by(index) %>% 
  summarise(count = n()) %>% 
  filter(count > 1) %>% 
  pull(index)

UpdatedDF <- foo
UpdatedDF[which(UpdatedDF$index %in% ModifiedIndexs),] <- bar

Keep Coding!

Use merge() to update a data frame with values from a second data frame

7 Answers7

Linked

Related