Merge with replacement based on multiple non-unique columns

Question

I have two data frames. The first one contains the original state of an image with all the data available to reconstruct the image from scratch (the entire coordinate set and their color values).

I then have a second data frame. This one is smaller and contains only data about the differences (the changes made) between the the updated state and the original state. Sort of like video encoding with key frames.

Unfortunately I don't have an unique id column to help me match them. I have an x column and I have a y column which, combined, can make up a unique id.

My question is this: What is an elegant way of merging these two data sets, replacing the values in the original dataframe with the values in the "differenced" data frame whose x and y coordinates match.

Here's some example data to illustrate:

original <- data.frame(x = 1:10, y = 23:32, value = 120:129)

    x  y value
1   1 23   120
2   2 24   121
3   3 25   122
4   4 26   123
5   5 27   124
6   6 28   125
7   7 29   126
8   8 30   127
9   9 31   128
10 10 32   129

And the dataframe with updated differences:

update <- data.frame(x = c(1:4, 8), y = c(2, 24, 17, 23, 30), value = 50:54)

  x  y value
1 1  2    50
2 2 24    51
3 3 17    52
4 4 23    53
5 8 30    54

The desired final output should contain all the rows in the original data frame. However, the rows in original where the x and y coordinates both match the corresponding coordinates in update, should have their value replaced with the values in the update data frame. Here's the desired output:

original_updated <- data.frame(x = 1:10, y = 23:32, 
                               value = c(120, 51, 122:126, 54, 128:129))

    x  y value
1   1 23   120
2   2 24    51
3   3 25   122
4   4 26   123
5   5 27   124
6   6 28   125
7   7 29   126
8   8 30    54
9   9 31   128
10 10 32   129

I've tried to come up with a vectorised solution with indexing for some time, but I can't figure it out. Usually I'd use %in% if it were just one column with unique ids. But the two columns are non unique.

One solution would be to treat them as strings or tuples and combine them to one column as a coordinate pair, and then use %in%.

But I was curious whether there were any solution to this problem involving indexing with boolean vectors. Any suggestions?

With data.table, this is as simple as `original[update, on=.(x,y), value := i.value ]` (once they are both data.tables). A similar question: http://stackoverflow.com/q/42587214/ The vignettes on the data.table website will get you started if you are new to the package. — Frank, Apr 04 '17 at 05:20

score 3 · Accepted Answer · answered Apr 04 '17 at 02:50

3

First merge in a way which guarantees all values from the original will be present:

merged = merge(original, update, by = c("x","y"), all.x = TRUE)

Then use dplyr to choose update's values where possible, and original's value otherwise:

library(dplyr)
middle = mutate(merged, value = ifelse(is.na(value.y), value.x, value.y))
final = select(middle, x, y, value)

answered Apr 04 '17 at 02:50

lebelinoz

4,890
10
33
56

Thanks. You can also `full_join(original, update, by = ("x", "y"))` if you want to keep it all dplyr. I think this answer is the most practical (along with Frank's comment) so I've accepted it! The match solution by **42-** stayed true to what I was asking for though. – Lauler Apr 04 '17 at 09:56

IRTFM · Answer 2 · 2017-04-04T04:29:11.487

The match function is used to generate indices. Needs a nomatch argument to prevent NA on the left hand side of data.frame.[<-. I don't think it is as transparent as a merge followed by replace, but I'm guessing it will be faster:

original[  match(update$x, original$x)[
                                       match(update$x, original$x, nomatch=0) == 
                                       match(update$y, original$y,nomatch=0)]   ,
          "value"] <- 
  update[ which( match(update$x, original$x) == match(update$y, original$y)), 
           "value"]

You can see the difference:

> match(update$x, original$x)[
            match(update$x, original$x) == 
                match(update$y, original$y) ]
[1] NA  2 NA  8
> match(update$x, original$x)[
            match(update$x, original$x, nomatch=0) == 
                match(update$y, original$y,nomatch=0)]
[1] 2 8

The "interior" match functions are returning:

> match(update$y, original$y)
[1] NA  2 NA  1  8
> match(update$x, original$x)
[1] 1 2 3 4 8

Merge with replacement based on multiple non-unique columns

2 Answers2