1

Does something like the 'paste_over' function below already exist within base R or one of the standard R packages?

paste_over <- function(original, corrected, key){
  corrected <- corrected[order(corrected[[key]]),]

  output <- original
  output[
    original[[key]] %in% corrected[[key]],
    names(corrected)
    ] <- corrected

  return(output)
}

An example:

D1 <- data.frame(
  k = 1:5,
  A = runif(5),
  B = runif(5),
  C = runif(5),
  D = runif(5),
  E = runif(5)
  )

D2 <- data.frame(
  k=c(4,1,3),
  D=runif(3),
  E=runif(3),
  A=runif(3)
  )

D2 <- D2[order(D2$k),]

D3 <- D1
D3[
  D1$k %in% D2$k,
  names(D2)
  ] <- D2


D4 <- paste_over(D1, D2, "k")

all(D4==D3)

In the example D2 contains some values that I want to paste over corresponding cells within D1. However D2 is not in the same order and does not have the same dimension as D1.

The motivation for this is that I was given a very large dataset, reported some errors within it, and received a subset of the original dataset with some corrected values. I would like to be able to 'paste over' the new, corrected values into the old dataset without changing the old dataset in terms of structure. (As the rest of the code I've written assume's the old dataset's structure.)

Although the paste_over function seems to work I can't help but think this must have been tackled before, and so maybe there's already a well known function that's both faster and has error checking. If there is then please let me know what it is. Thanks.

talat
  • 68,970
  • 21
  • 126
  • 157
JonMinton
  • 1,239
  • 2
  • 8
  • 26

2 Answers2

4

We can accomplish this using data.table as follows:

setkeyv(setDT(D1), "k")
cols = c("D", "E", "A")
D1[D2, (cols) := D2[, cols]]
  • setDT() converts a data.frame to data.table by reference (without actually copying the data). We want D1 to be a data.table.

  • setkey() sorts the data.table by the column specified (here k) and marks that column as sorted (by setting the attribute sorted) by reference. This allows us to perform joins using binary search.

  • x[i] in data.table performs a join. You can read more about it here. Briefly, for each row of column k in D2, it finds the matching row indices in D1 by matching on D1's key column (here k).

  • x[i, LHS := RHS] performs the join to find matching rows, and the LHS := RHS part adds/updates x with the columns specified in LHS with the values specified in RHS by reference. LHS should be a a vector of column names or numbers, and RHS should be a list of values.

    So, D1[D2, (cols) := D2[, cols]] finds matching rows in D1 for k=c(1,3,4) from D2 and updates the columns D,E,A specified in cols by the list (a data.frame is also a list) of corresponding columns from D2 on RHS.

D1 will now be modified in-place.

HTH

Community
  • 1
  • 1
Arun
  • 116,683
  • 26
  • 284
  • 387
1

You could use the replacement method for data frames in your function, like this maybe. It does adequate checking for you. I chose to pass the logical row subset as an argument, but you can change that

pasteOver <- function(original, corrected, key) {
    "[<-.data.frame"(original, key, names(corrected), corrected)
}

(p1 <- pasteOver(D1, D2, D1$k %in% D2$k))
  k          A           B         C         D          E
1 1 0.18827167 0.006275082 0.3754535 0.8690591 0.73774065
2 2 0.54335829 0.122160101 0.6213813 0.9931259 0.38941407
3 3 0.62946977 0.323090601 0.4464805 0.5069766 0.41443988
4 4 0.66155954 0.201218532 0.1345516 0.2990733 0.05296677
5 5 0.09400961 0.087096652 0.2327039 0.7268058 0.63687025

p2 <- paste_over(D1, D2, "k")
identical(p1, p2)
# [1] TRUE
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245