Improving performance of updating contents of large data frame using contents of similar data frame

Question

I'm looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of datasets, each with thousands of rows and upwards of 10,000 columns. An "update" dataset will overlap its corresponding "base" dataset by anywhere from a few percent to perhaps 50 percent, rowwise. The datasets have a "key" column and there will be only one row per each unique key value in any given dataset.

The basic rule is: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with that value. (The "same cell" means same value of the "key" column and colname.)

Note the update dataset will likely contain new rows ("inserts") which I can handle with an rbind.

So given the base data frame "df1", where column "K" is the unique key column, and "P1" .. "P3" represent the 10,000 columns, whose names will vary from one pair of datasets to the next:

  K P1 P2 P3
1 A  1  1  1
2 B  1  1  1
3 C  1  1  1

...and the update data frame "df2":

  K P1 P2 P3
1 B  2 NA  2
2 C NA  2  2
3 D  2  2  2

The result I need is as follows, where the 1's for "B" and "C" were overwritten by the 2's but not overwritten by the NA's:

  K P1 P2 P3
1 A  1  1  1
2 B  2  1  2
3 C  1  2  2
4 D  2  2  2

This doesn't seem to be a merge candidate as merge gives me either duplicate rows (with respect to the "key" column) or duplicate columns (e.g. P1.x, P1.y), which I have to iterate over to collapse somehow.

I have tried pre-allocating a matrix with the dimensions of the final rows/columns, and populating it with the contents of df1, then iterating over the overlapping rows of df2, but I cannot get better than 20 cells per second performance, requiring hours to complete (compared to minutes for the equivalent DATA step UPDATE functionality in SAS).

I'm sure I'm missing something, but can't find a comparable example.

I see ddply usage that looks close, but not a general solution. The data.table package didn't seem to help as it's not obvious to me that this is a join problem, at least not generally over so many columns.

Also a solution that focuses only on the intersecting rows is adequate as I can identify the others and rbind them in.

Here is some code to fabricate the data frames above:

cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);

Thanks

Possible duplicate: http://stackoverflow.com/questions/9918450/join-matching-columns-in-a-data-frame-or-data-table/9918769#9918769 — Tyler Rinker, Apr 04 '12 at 16:52
I take it back this is not a duplicate. I didn't read carefully enough. You want a merge with a replace NAs of one df with another df. A bit more complex. — Tyler Rinker, Apr 04 '12 at 17:37
In `data.table` one way would be to flatten both `df1` and `df2` to 3 columns: (K,P,val) each with a 2-column key (K,P). Then `df1[df2,val:=df2.val]` and unflatten afterwards. Or, keeping the same structure you have, in a loop through `df2` do `df1[k,p:=value,with=FALSE]` which will be fast because loops on `data.table`s are much faster. If you like the loop approach then `set()` is even faster than `:=`. — Matt Dowle, Apr 10 '12 at 12:05
@MatthewDowle The normalized (flattened) route with `df1[df2,val:=df2.val]` gives `Error in := (val, df2.val) : := is defined for use in j only; i.e., DT[i,col:=1L] not DT[i,col]:=1L or DT[i]$col:=1L.`. — gkaupas, Apr 12 '12 at 18:25
`df1` needs to be a `data.table`; e.g. `df1=as.data.table(df1)`. I'll add something to that error message to suggest checking that type. — Matt Dowle, Apr 12 '12 at 20:47
`dt1 <- data.table(read.table(text="K1 K2 V\nA P1 1\nA P2 1\nB P1 1\nB P2 1", header=TRUE),key=c("K1","K2"));` then `dt2 <- data.table(read.table(text="K1 K2 V\nB P1 2\nB P2 1\nC P1 2\nC P2 2", header=TRUE),key=c("K1","K2"));` then `dt1[dt2,V:=dt2.V]` yields error `combining bywithoutby with := in j is not yet implemented`. — gkaupas, Apr 13 '12 at 13:50
Three part comment, part one: Take Tyler's `decider` function below. Fabricate two very small example normalized tables. Apply the function. Result is correct. 1: `decider <- function(x, y) ifelse(is.na(x), y, x);` 2: `dt1 <- data.table(read.table(text="KEY1 KEY2 OLDVAL\nA P1 1\nA P2 1\nB P1 1\nB P2 1", header=TRUE),key=c("KEY1","KEY2"));` 3: `dt2 <- data.table(read.table(text="KEY1 KEY2 NEWVAL\nA P1 2\nA P2 NA\nB P1 2\nB P2 2", header=TRUE),key=c("KEY1","KEY2"));` 4: `dt1[dt2,decider(NEWVAL,OLDVAL)]` — gkaupas, Apr 13 '12 at 14:50
Second part: Now use data.table example to generate monster table. With `grpsize = ceiling(6e5/26^2);` and `dt1<-data.table(data.frame(KEY1=rep(factor(LETTERS),each=26*grpsize),KEY2=rep(factor(letters),each=grpsize),OLDVAL=runif(grpsize*26^2)),key=c("KEY1","KEY2"));` and repeat for dt2 (change OLDVAL to NEWVAL), then `dt1[dt2,decider(NEWVAL,OLDVAL)]` and `gc` before/after I see memory peaks at 8 Gb for two 10 Mb tables... far too costly. If I change grpsize to use `7e5`, the `[` fails with `cannot allocate vector of size 2.7 Gb`. — gkaupas, Apr 13 '12 at 14:51
Third part: Now use `grpsize = ceiling(1e7/26^2);` as shown in data.table documentation. The `dt1[dt2,decider(NEWVAL,OLDVAL)]` fails with `Error in rep(1L, sum(lengths)) : invalid 'times' argument In addition: Warning message: In sum(lengths) : Integer overflow - use sum(as.numeric(.))`. A normalized dataset of 1e7 rows is "small" for my application. It seems I have to write a layer of partitioning code and spoon-feed the datasets to R in smaller pieces. It is not obvious to me how packages like `bigmemory` or `ff` help as there is no mention of actively paging data to/from disk on demand. — gkaupas, Apr 13 '12 at 14:52
Please confirm if the approach in my answer works or not, and scales. Then at least I understand the problem and we can look at a more elegant solution after that ... — Matt Dowle, Apr 15 '12 at 17:04

Tyler Rinker · Answer 1 · 2012-04-04T17:57:27.433

2

This is likely not the fastest solution but is done entirely in base.

(updated answer per Tommy's comments)

#READING IN YOUR DATA FRAMES
df1 <- read.table(text="  K P1 P2 P3
1 A  1  1  1
2 B  1  1  1
3 C  1  1  1", header=TRUE)

df2 <- read.table(text="  K P1 P2 P3
1 B  2 NA  2
2 C NA  2  2
3 D  2  2  2", header=TRUE)

all <- c(levels(df1$K), levels(df2$K))                  #all cells of key column
dups <- all[duplicated(all)]                            #the overlapping key cells
ndups <- all[!all %in% dups]                            #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows

decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ], 
    FUN = decider)) #repalce all NAs of df2 with df1 values if they exist

df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ]  #reorder based on key column
rownames(df5) <- 1:nrow(df5)  #give proper non duplicated rownames
df5

This yields:

  K P1 P2 P3
1 A  1  1  1
2 B  2  1  2
3 C  1  2  2
4 D  2  2  2

Upon closer reading not all columns have the same name but I am assuming the same order. this may be a more helpful approach:

all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
        colnames(LS[[i]]) <- colnames(LS[[2]])
        return(LS[[i]])
    }
)

LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])

decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
df5

edited Apr 04 '12 at 17:57

answered Apr 04 '12 at 16:50

Tyler Rinker

108,132
65
322
519

...so how does this hande the key column matching? And it it seems all numeric values turn into factors... – Tommy Apr 04 '12 at 16:54
@Tommy I updated to reflect your factor comment. That was because I just cut and pasted from the last time I gave a similar anser. – Tyler Rinker Apr 04 '12 at 17:13
@Tommy I get the key column matching now as well. Didn't catch that before. back to the drawing board. I thought the 2 dataframes were the same observations. – Tyler Rinker Apr 04 '12 at 17:15
@TylerRinker It is OK to assume the columns will have the same names and be in the same order; I have some quick pre-processing that ensures that. – gkaupas Apr 04 '12 at 18:11
If all files have the same names then my first approach is likely easier. Not sure how it will fair with speed. Once you test it let me know. – Tyler Rinker Apr 04 '12 at 18:12
@TylerRinker If I use the first approach but with my files/read.tables, then my data frames don't have factors, so the "all" object is NULL. If I coerce the K columns to be factors with `df1$K <- as.factor(df1$K)`, then the rbind borks with `invalid factor level, NAs generated`. The second approach works; I am trying with a large dataset. – gkaupas Apr 04 '12 at 18:24
@gkaupas I addressed this in the second approach but not the first. So this approach works for you then? How is the speed on 10,000 columns? – Tyler Rinker Apr 04 '12 at 18:26
@TylerRinker: 200 seconds to read in the files, 93 seconds through creation of LS2, 35 more seconds through LS4, 886 seconds for the mapply, 51 seconds after that, plus 57 seconds to write.csv the result; 22 minutes total, about 3X the time SAS requires. Much better than my for-loops for sure. – gkaupas Apr 04 '12 at 19:37

score 2 · Accepted Answer · answered Apr 19 '12 at 00:52

2

This loops by column, setting dt1 by reference and (hopefully) should be quick.

dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
    stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
    nna = !is.na(dt2[[i]])
    set(dt1,w[nna],i,dt2[[i]][nna])
}
dt1 = rbind(dt1,dt2[is.na(w)])
dt1
     K P1 P2 P3
[1,] A  1  1  1
[2,] B  2  1  2
[3,] C  1  2  2
[4,] D  2  2  2

answered Apr 19 '12 at 00:52

Matt Dowle

58,872
22
166
224

Now we're cooking. I am getting "merge" rates of 3,200,000 cells per second for 1000-row datasets increasing to 7,200,000 cells per second for 10,000-row datasets, with the rates seemingly independent of the number of columns in the dataset. **However** the code seems to behave _only if the values are numeric_. My test case is all columns are character populated with A's and B's read in with `data.table(read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE), key=c("K"))` with the result being all of `dt1` apparently replaced by all of `dt2`. What might cause the distinction? – gkaupas Apr 19 '12 at 19:11
Phew, sounds promising. There's no reason why character shouldn't work the same as numeric, afaik. Will need an example please showing what you get vs what you expect. Wild guess that it's the difference between `NA_character_` and `"NA"`? – Matt Dowle Apr 19 '12 at 23:22
My fault, your guess is indeed the cause; in my read.table I forgot to specify `na.strings=""`. Once that was added, your code works fine. Example [here](http://pastebin.com/Wr90cBuU) where `dt2` contains 2 updates, first record overwrites same in `dt1`, second has both columns NA (one is numeric, one is character), and it correctly has no effect on matching record in `dt1`. I will re-run my benchmarks where `dt2` has random NA's in it and `read.table` has correct options, but this solution is clearly the most effective one. Thank you for your diligence. – gkaupas Apr 20 '12 at 16:54
Relief. Thanks for your patience, got there in the end. It's been useful for me too as it turns out. – Matt Dowle Apr 20 '12 at 19:31
Epilogue: Despite the 3 million+ cell/second processing, on my hardware, unfortunately my overall throughput is gated by reading/writing the data. Whether I read/write CSV files, or load/save uncompressed .RData files, my effective processing rate drops to 250K cells/second on the server I used for the R benchmarking, or around 120K cells/second on the hardware running SAS. These are numbers for just read/write or load/save, no merging code in between. SAS reads/merges/writes at a net 600K cells/second. Foo. – gkaupas Apr 23 '12 at 18:23
I don't quite follow. 250k/sec (R) is faster than 120k/sec (SAS). Is the time R takes to load data missing? You have taken into account network latency, and isolated that by testing local files? However R's file reader is very slow. There are some settings to try (search SO). Sqldf has a fast file reader. Data.table has the bare bones of a fast file reader (to be completed). – Matt Dowle Apr 23 '12 at 19:59
Full SAS script (read, merge, write) on "the SAS server" runs 600K cells/second. Full R script (read, data.table merge, write) on a faster server runs 250K cells/second. Simple R script (read, write) in R 2.7 on "the SAS server" runs 120K cells/second. In all cases the files are on the same NFS-mounted filesystem (yeah, not the best). All 3 (NFS filer, SAS server, R 2.15 server) are in the same data center. Testing plain R I/O on "the SAS server" was to get closer to comparing apples to apples, and base R takes 5X the time SAS does. I'll revisit sqldf. – gkaupas Apr 24 '12 at 16:21
Well, sqldf is reading at around 500K cells/second. But now I have to deal with [the NA problem](http://code.google.com/p/sqldf/#14._How_does_one_read_files_where_numeric_NAs_are_represented_as), where the hack is to doctor the incoming data with placeholder values, then clean up those values afterwards. That cleanup process, at least using the suggested `is.na(DF) <- DF == -999` (whether DF's or DT's) is killing me with a 16K cell/second fix rate. Is there a better way to write that fix using `data.table` kung fu? – gkaupas Apr 25 '12 at 15:51
There is. See [this answer](http://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table) (go straight to the EDIT at the end) and [this thread](http://r.789695.n4.nabble.com/Column-wise-value-replacement-tp4564391p4564391.html) for more info. And glad sqldf worked out. – Matt Dowle Apr 25 '12 at 16:16

Matt Dowle · Answer 3 · 2012-04-19T09:46:22.243

1

EDIT : Please ignore this answer. Bad idea to loop by row. It works but is very slow. Left for posterity! See my 2nd attempt as separate answer.

require(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
    k = K[i]
    p = unlist(dt2[i,-1,with=FALSE])
    p = p[!is.na(p)]
    dt1[J(k),names(p):=as.list(p),with=FALSE]
}

or, can you use matrix instead of data.frame? If so it could be a single line using A[B] syntax where B is a 2-column matrix containing the row and column numbers to update.

edited Apr 19 '12 at 09:46

answered Apr 10 '12 at 12:15

Matt Dowle

58,872
22
166
224

This fails with `Error in [.data.table(p, !is.na(p)) : i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14).` Re: the `A[B]` syntax, if I need to contruct `B` with update instructions, then I have to iterate through all the cells to determine that, which defeats the purpose, unless I'm misunderstanding. – gkaupas Apr 12 '12 at 18:20
@gkaupas Apols, meant `unlist` rather than `as.vector`. Have edited answer. On `A[B]` you can very likely vectorize the creation of `B`. If `matrix` is ok for your case then (all columns integer or all columns numeric, and n*m < 2^31) then `matrix` and `A[B]` seems more appropriate and should be fast. – Matt Dowle Apr 12 '12 at 21:10
This moves the error down to the last line in the loop; something of the form `Supplied n items to be assigned to 1 items of column 'P1' (n-1 unused)`. My data will be a mix of strings and numbers so I guess that means no `matrix` route here. – gkaupas Apr 13 '12 at 13:56
@gkaupas Apols again, will edit answer to change `names(p):=p` to `names(p):=as.list(p)`. Ok, yes that means no `matrix` route. Must be an easier way in data.table (perhaps by wrapping it up in some generic function that does that task). – Matt Dowle Apr 13 '12 at 15:22
Well the `dt1[dt2,decider(NEWVAL,OLDVAL)]` I posted in the 3-part comment at the top, based on your input and Tyler's, is tantalizingly simple, even if it requires me to normalize my data first, and only handles a subset of the rows (i.e. I can either get back all rows of dt1 or all rows of dt2 and have to rbind in the rest). Unfortunately it just doesn't scale to my data volume. – gkaupas Apr 13 '12 at 18:15
But does this answer work now ok? And does it scale for your data volume? I'm not following the comments to the question at the top - could you draw it together into one code block that I could paste into a fresh R session? – Matt Dowle Apr 15 '12 at 16:56
Well it runs, but I don't see what it does. Both dt1 and dt2 appear unchanged afterwards. Do I need to add code inside the for loop? IIRC, when I put print()'s around the statements to see what they returned, the last statement in the for loop just returned the entire dt2 object each time; I don't see how that helps. As for the comments at the top: the 600-byte comment length limit made it prohibitive to repeat the long code over and over. – gkaupas Apr 16 '12 at 12:49
`:=` updates by reference. See `?":="`. Are you sure that `dt2` (not `df2`) has not been updated? – Matt Dowle Apr 17 '12 at 09:09
I see: if I change the last line inside the loop from `dt2[` to `dt1[`, then `dt1` does indeed change as desired. Otherwise apparently dt2 just overwrites itself. I will try this with some larger datasets this afternoon. I will still need to rbind in the new row(s) from dt2, but I already conceded to having to do that in my original post. Thanks – gkaupas Apr 17 '12 at 14:38
I ran 32 cases, with datasets ranging from 1,000,000 cells to 100,000,000 cells. The "merge" rate of this `data.table` loop was a pretty consistent 32,300 cells/second. Previously on the same machine, using `mapply` on data frames, I averaged 147,800 cells/second. I get 600,200 cells/second with SAS on ancient 1.35 GHz RISC CPUs (vs the 2.0 GHz Nehalem's I'm using for R). `gc()` says memory maxes at 4 Gb when combining the 2 100,000,000 cell data.tables, and the server has 8 Gb available and nothing else running. – gkaupas Apr 18 '12 at 17:13
Something is wrong then. If you post your test code so we can reproduce your timings then I (at least) will try to improve it. – Matt Dowle Apr 18 '12 at 20:23
You can see my Perl script [here](http://pastebin.com/hiPSunrN). It takes two (optional) arguments, the number of rows and columns to generate. It creates two input files with the first record containing column headings starting with K, then P000000, P000001, so on. Then rows of data with key values K000000 incrementing upward. The first file's cells are all A's. The second all B's. It generates an R script and has a system call to invoke it. The R script just reads in the two files, runs your loop, writes the result out as a CSV, reports the timings. – gkaupas Apr 18 '12 at 22:53
Wow. Ok, you know your stuff and it's a thorough set of tests. Reproduced here ok. Realised the `for()` loop was the wrong way around, should be by column (not row). Or, back to the normalized (flattened) route as first suggested. Looking back, the error `combining bywithoutby with := in j is not yet implemented` can be solved by setting `mult="first"`, assuming that the keys are unique. Also assuming that the tests are simplified and we can't speed up by knowing input1 and input2 have the same number of rows with the same keys like the tests. Anyway, will modify the loop to go by column ... – Matt Dowle Apr 19 '12 at 00:12
@MattDowle the answer doesn't even work anymore maybe delete? otherwise could you update to avoid `with=FALSE`? I don't think we can for `-1` yet but the latter case should be OK – MichaelChirico Sep 27 '19 at 03:18

score 0 · Answer 4 · answered Jan 16 '13 at 04:16

The following gives the correct answer for the small example data, tries to minimize the number of "copies" of tables, and uses the new fread and (new?) rbindlist. Does it work with your larger actual data set? I didn't quite follow all the comments in the original post about the memory issues you had when trying to flatten/normalize/stack, so apologies if you've already tried this route.

library(data.table)
library(reshape2)

cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")

dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table

dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
setnames(dt2s,"value","value.new")

dt1s[dt2s,value:=value.new] # Update new values

dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setkey(dtout,K)
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names

Improving performance of updating contents of large data frame using contents of similar data frame

4 Answers4