14

Given 2 data frames that are identical in terms of column names/datatypes, where some columns uniquely identify the rows, is there an efficient function/method for one data.frame to "update" the other?

For example, in the following, original and replacement are identified by 'Name' and 'Id'. goal is the result of finding all rows from replacement in original (by the unique id's) and replacing with Value1 and Value2

original = data.frame( Name = c("joe","john") , Id = c( 1 , 2) , Value1 = c(1.2,NA), Value2 = c(NA,9.2) )
replacement = data.frame( Name = c("john") , Id = 2 , Value1 = 2.2 , value2 = 5.9)
goal = data.frame( Name = c("joe","john") , Id = c( 1 , 2) , Value1 = c(1.2,2.2), Value2 = c(NA,5.9) )

The solution should work for an original and replacement of arbitrary length (although replacement should never have more rows than original). In practice, I'm using 2 id columns.

Arun
  • 116,683
  • 26
  • 284
  • 387
Suraj
  • 35,905
  • 47
  • 139
  • 250

7 Answers7

11

I'd use data.table objects. This code seems to work on your example:

library(data.table)

# set keys
original.dt <- data.table(original, key=c("Name", "Id"))        
replacement.dt <- data.table(replacement, key=c("Name", "Id"))

goal2 <- original.dt
# subset and reassign
# goal2[replacement.dt[, list(Name, Id)]] <- replacement.dt
goal2[replacement.dt] <- replacement.dt  # cleaner and faster, see Matthew's comment

goal2 <- as.data.frame(goal2)

identical(goal, goal2) # FALSE, why? See Joris's comment
all.equal(goal, goal2) # TRUE
Ryogi
  • 5,497
  • 5
  • 26
  • 46
  • thanks! I've avoided a dependency on data.table thus far and would prefer a solution using the base classes. Will wait for other replies. – Suraj Nov 01 '11 at 19:37
  • 1
    +1 for the data.table. As to why identical returns FALSE: in goal2, Id is an int vector, whereas in goal this is a num vector. – Joris Meys Nov 01 '11 at 19:41
  • @JorisMeys ... as `str()` reveals. Thanks! – Ryogi Nov 01 '11 at 19:43
  • 3
    Small improvement : can drop the `[,list(Name,Id)]` bit. Just `goal2[replacement.dt]<-replacement.dt` is shorter and faster because `replacement.dt`'s key is already `"Name,Id"`. `i` doesn't have to be keyed, but when it is, it's `i`'s key columns that are joined and a faster merge algorithm is used internally that takes advantage of the fact that both tables are sorted. – Matt Dowle Nov 02 '11 at 09:22
7

Just set a unique ID as the row names. Then it is simple indexing:

rownames(original) = original$Id
rownames(replacement) = replacement$Id

original[rownames(replacement), ] = replacement
John Colby
  • 22,169
  • 4
  • 57
  • 69
  • what if there will be rows with equal Id? for ex. you can't do rownames(original) <- c(1,1,1) – Max Nov 01 '11 at 19:51
  • They just have to be unique. For example paste(idvar1, idvar2, ...) – John Colby Nov 01 '11 at 19:58
  • my fault, arrr. if there are several places in original, and replacement with one Id, than it is unclear from where to where should be replaced – Max Nov 01 '11 at 20:09
  • but there are two identifiers here - Id and Name. I don't think this approach works for non-integer identifiers – Suraj Nov 01 '11 at 20:09
  • @SFun28, why this shouldn't? rownames are strings – Max Nov 01 '11 at 20:12
  • It seems that in this example rownames(replacement) returns strings which are coerced to integers which allows original to be subsetted? What if the identifier is Name? what does original[ "Joe" , ] do? – Suraj Nov 01 '11 at 20:14
  • @SFun28 You *can* do the indexing by character. Check out this example: https://gist.github.com/1331778 – John Colby Nov 01 '11 at 20:20
  • @Max Yea definitely not the most robust like you point out, but I like it when I can get away with using it because it's simple and fast. – John Colby Nov 01 '11 at 20:21
  • +1 That does in fact work. cool. R continues to amaze (and slightly scare) me every day. =) – Suraj Nov 01 '11 at 20:24
  • Yea! It really comes in handy. Good luck! – John Colby Nov 01 '11 at 20:25
6

Using base R, you can use the function replace.df() below, which is loosely based on the source code of merge.data.frame(). Contrary to some other solutions, this one allows multiple columns for identification. I use it rather often in my job. Feel free to copy and use.

This function controls for cases where rows in y are not found in x. Mind that the function does not check whether the combinations are unique. match() will only replace the first occurence by the first occurence of a combination.

The function is used as follows :

> replace.df(original, replacement,by=c('Name','Id'))
  Name Id Value1 Value2
1  joe  1    1.2     NA
2 john  2    2.2    9.2

Note that this effectively detects the writing error you have in your original code. replacement contains a variabe named 'value2' (small v) instead of Value2 (capital V). After correcting this, the result becomes:

> replace.df(original, replacement,by=c('Name','Id'))
  Name Id Value1 Value2
1  joe  1    1.2     NA
2 john  2    2.2    5.9

You can use that function as well for changing the values in only some of the columns

> replace.df(original, replacement,by=c('Name','Id'),cols='Value2')
  Name Id Value1 Value2
1  joe  1    1.2     NA
2 john  2     NA    5.9

The function:

replace.df <- function(x,y,by,cols=NULL
           ){
    nx <- nrow(x)
    ny <- nrow(y)

    bx <- x[,by,drop=FALSE]
    by <- y[,by,drop=FALSE]
    bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))

    bx <- bz[seq_len(nx)]
    by <- bz[nx + seq_len(ny)]

    idx <- match(by,bx)
    idy <- match(bx,by)
    idy <- idy[!is.na(idy)]

    if(is.null(cols)) {
      cols <- intersect(names(x),names(y))
      cols <- cols[!cols %in% by]
    }

    x[idx,cols] <- y[idy,cols]
    x
  }
Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • sorry, I meant to say "you don't HAPPEN". fixed the post. – Suraj Nov 01 '11 at 20:37
  • @SFun28 : ah no, but I maybe should start doing that stuff. I also corrected a typo (the drop=FALSE in the last lines was obviously erroneous. I'm not that good in reproducing functions from the top of my head...) – Joris Meys Nov 01 '11 at 20:40
  • @TylerRinker : Did you see my correction? The drop=FALSE in the second last line is rubbish and should be removed. It is not an argument for the assignment function. I typed the function from the top of my head, the original is at work. Hence the mistake. – Joris Meys Nov 01 '11 at 21:20
  • @Joris Now it works with everything except: `cols='Value2'` which throws up the warning `Error in [<-.data.frame(*tmp, idx, cols, value = NULL) : replacement has length zero` – Tyler Rinker Nov 01 '11 at 22:15
2

Here is an approach using the digest package.

library(digest)
# generate keys for each row using the md5 checksum based on first two columns
check1 <- apply(original[,1:2], 1, digest)
check2 <- apply(replacement[,1:2], 1, digest)

# set goal to original and replace rows in replacement
goal <- original
goal[check1 %in% check2,] <- replacement
Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • This is a good idea, but it suffers in terms of performance. You need to calculate the digest which would be slower than simply concatenating the identifiers. – Suraj Nov 01 '11 at 20:12
1

I produced a function that uses the method of indexing (see answer by John Colby above). Hopefully it can be useful for all such needs of updating one data frame with the values from another data frame.

update.df.with.df <- function(original, replacement, key, value) 
{
    ## PURPOSE: Update a data frame with the values in another data frame
    ## ----------------------------------------------------------------------
    ## ARGUMENT:
    ##   original: a data frame to update,
    ##   replacement: a data frame that has the updated values,
    ##   key: a character vector of variable names to form the unique key
    ##   value: a character vector of variable names to form the values that need to be updated
    ## ----------------------------------------------------------------------
    ## RETURN: The updated data frame from the old data frame "original". 
    ## ----------------------------------------------------------------------
    ## AUTHOR: Feiming Chen,  Date:  2 Dec 2015, 15:08

    n1 <- rownames(original) <- apply(original[, key, drop=F], 1, paste, collapse=".")
    n2 <- rownames(replacement) <- apply(replacement[, key, drop=F], 1, paste, collapse=".")

    n3 <- merge(data.frame(n=n1), data.frame(n=n2))[[1]] # make common keys
    n4 <- levels(n3)[n3]                # convert factor to character

    original[n4, value] <- replacement[n4, value] # update values on the common keys
    original
}
if (F) {                                # Unit Test 
    original <- data.frame(x=c(1, 2, 3), y=c(10, 20, 30))
    replacement <- data.frame(x=2, y=25)
    update.df.with.df(original, replacement, key="x", value="y") # data.frame(x=c(1, 2, 3), y=c(10, 25, 30))

    original <- data.frame(x=c(1, 2, 3), w=c("a", "b", "c"), y=c(10, 20, 30))
    replacement <- data.frame(x=2, w="b", y=25)
    update.df.with.df(original, replacement, key=c("x", "w"), value="y") # data.frame(x=c(1, 2, 3), w=c("a", "b", "c"), y=c(10, 25, 30))

    original = data.frame(Name = c("joe","john") , Id = c( 1 , 2) , Value1 = c(1.2,NA), Value2 = c(NA,9.2))
    replacement = data.frame(Name = c("john") , Id = 2 , Value1 = 2.2 , Value2 = 5.9)
    update.df.with.df(original, replacement, key="Id", value=c("Value1", "Value2"))
    ## goal = data.frame( Name = c("joe","john") , Id = c( 1 , 2) , Value1 = c(1.2,2.2), Value2 = c(NA,5.9) )
}
Feiming Chen
  • 69
  • 1
  • 3
1
# limit replacement to elements that have a correspondence in original 
existing = replacement[is.element(replacement$Id, original$Id),]
# replace original at positions where IDs from existing match   
original[match(existing$Id,original$Id),]=existing
themel
  • 8,825
  • 2
  • 32
  • 31
  • Eh, the is.element generalizes easily (just ANDing the index arrays for both), but I can't see a straightforward way for the match part. – themel Nov 02 '11 at 01:14
1
require(plyr)
indexes_to_replace <- rownames(match_df(original,replacement,on='Id'))
indexes_from_replace<-rownames(match_df(replacement,original,on='Id'))
original[indexes_to_replace,] <- replacement[indexes_from_replace,]

param on of function match_df can take vectors as well.

Max
  • 4,792
  • 4
  • 29
  • 32
  • This is similar to John Colby's answer, but I think a little slower because of the call to match_df? – Suraj Nov 01 '11 at 20:28