R - merge and update primary dataset

Question

So, I have two dataset representing old and current address.

> main
idspace   id  x  y move
    198 1238 33  4 stay
    641 1236 36 12 move
    1515 1237 30 28 move

> move
idspace   id x y move
      4 1236 4 1 move

What I need is to merge the new data (move) with the old (main) and update main once merged.

I was wondering if it was possible to it in one operation ?

The update is based on id, wich is the personal identifier.

idspace, x, y are the location id.

So, the output I need is

> main
  idspace   id   x  y move
       198 1238 33  4 stay
       4   1236 4   1 move # this one is updated 
       1515 1237 30 28 move

I have no clue how I can do this.

Something like

merge(main, move, by = c('id'), all = T, suffixes = c('old', 'new'))

However, this is wrong because I need to do so many manipulations by hand.

Any solution ?

data

> dput(main)
structure(list(idspace = structure(c(2L, 3L, 1L), .Label = c("1515", 
"198", "641"), class = "factor"), id = structure(c(3L, 1L, 2L
), .Label = c("1236", "1237", "1238"), class = "factor"), x = structure(c(2L, 
3L, 1L), .Label = c("30", "33", "36"), class = "factor"), y = structure(c(3L, 
1L, 2L), .Label = c("12", "28", "4"), class = "factor"), move =     structure(c(2L, 
1L, 1L), .Label = c("move", "stay"), class = "factor")), .Names = c("idspace", 
"id", "x", "y", "move"), row.names = c(NA, -3L), class = "data.frame")

> dput(move)
structure(list(idspace = structure(1L, .Label = "4", class = "factor"), 
id = structure(1L, .Label = "1236", class = "factor"), x = structure(1L,  .Label = "4", class = "factor"), 
    y = structure(1L, .Label = "1", class = "factor"), move = structure(1L, .Label = "move", class = "factor")), .Names = c("idspace", 
"id", "x", "y", "move"), row.names = c(NA, -1L), class = "data.frame")`

I think it's a dup as `tmp <- rbind(move,main); tmp[!duplicated(tmp$id),]` logic works just fine, assuming there's no other requirements here. — thelatemail, Aug 22 '16 at 01:16
@thelatemail I was thinking of using `sqldf`, but I don't know that API well enough to answer. — Tim Biegeleisen, Aug 22 '16 at 01:17
@TimBiegeleisen - maybe `sqldf(" select coalesce(b.idspace,a.idspace) as idspace, coalesce(b.id,a.id) as id, coalesce(b.x,a.x) as x, coalesce(b.y,a.y) as y, coalesce(b.move,a.move) as move from main a left join move b on a.id = b.id ")` - ugly but it does work. — thelatemail, Aug 22 '16 at 01:30
it should exist an elegant `sql` for sure, I need to dig into it but thanks @TimBiegeleisen — giac, Aug 22 '16 at 01:38

score 9 · Accepted Answer · answered Aug 22 '16 at 01:22

9

Using the join+update feature of data.table:

require(data.table) # v1.9.6+
setDT(main) # convert data.frames to data.tables by reference
setDT(move)

main[move, on=c("id", "move"),  # extract the row number in 'main' where 'move' matches
       c("idspace", "x", "y") := .(i.idspace, i.x, i.y)] # update cols of 'main' with 
                                                         # values from 'i' = 'move' for 
                                                         # those matching rows


main
#    idspace   id  x  y move
# 1:     198 1238 33  4 stay
# 2:       4 1236  4  1 move
# 3:    1515 1237 30 28 move

This updates main in-place.

answered Aug 22 '16 at 01:22

Arun

116,683
26
284
387

1

this is great ! any `dplyr` routine per chance ? – giac Aug 22 '16 at 01:26
2

Asking the main data.table developer for a dplyr solution... Hmm... – nrussell Aug 22 '16 at 01:31
ok ok sorry ;) - still awesome solution ! – giac Aug 22 '16 at 01:39

score 1 · Answer 2 · answered Aug 22 '16 at 01:51

Here's a dplyr solution:

# If you want both old and new
dplyr::full_join(main, move)

# If you want both old and new with a suffix column
main$suffix <- "old"
move$suffix <- "new"
dplyr::full_join(main, move)

# If you want new only
new       <- dplyr::left_join(main,move,by="id") # could also use  %>%
main[!is.na(new$move.y),1]   <- new[!is.na(new$move.y),6]  
main[!is.na(new$move.y),3:4] <- new[!is.na(new$move.y),7:8]

score 1 · Answer 3 · answered Aug 22 '16 at 10:06

1

I think I found a very simple way to solve this problem with

main = as.matrix(main)
move = as.matrix(move)

main[main[,'id'] %in% move[,'id'], ] <- move

Which matches id, keeps the id ordered and only change the matched rows. It seems to work on the whole dataset.

answered Aug 22 '16 at 10:06

giac

4,261
5
30
59

Note that there's no way of knowing which `main$id` matches with which `move$id` in this case. You're assuming that the matches will be in the same order in which rows in `move` are. – Arun Aug 22 '16 at 17:43
@Arun you are absolutely right. However, it does seems to work. I also tried `main[main[,'id'] %in% move[,'id'], c('idspace', 'x', 'y','move')] <- move[which(move[,'id'] %in% main[,'id']), c('idspace', 'x', 'y','move')]` which also update fine. In the latter case, this `id` are matched. Thanks again for your patience and attention! – giac Aug 22 '16 at 22:13
1

`%in%` returns a logical vector. It always preserves the order of the input data upon subset. Try it on a more complex example. E.g., if the 1st and 3rd entries of `main$id` match to 3rd and 1st entries of `move$id`, you'd be assigning the 1st and 3rd rows of `move` to 1st and 3rd rows of `main`. That'd be wrong. – Arun Aug 22 '16 at 23:21
yes you are right. Just make sure as you say to keep the same order. – giac Aug 23 '16 at 15:42

R - merge and update primary dataset

3 Answers3