1

I have a dataset with around 25 million rows. I am taking a subset of these rows and performing a function which works fine. However, what I then need to do is update the values in original dataset with new values while retaining the rest. I am sure this is straightforward but I just can't get my head around it.

This is a simplified version of what I am dealing with:

require("data.table")

df <-data.frame(AREA_CD = c(sample(1:25000000, 25000000, replace=FALSE)), ALLOCATED = 0, ASSIGNED = "A", ID_CD = c(1:25000000))
df$ID_CD <- interaction( "ID", df$ID_CD, sep = "")
dt <- as.data.table(df)

sub_dt <- dt[5:2004,]
sub_dt[,ALLOCATED:=ALLOCATED+1]
sub_dt[,ASSIGNED:="B"]

What I am after is the values in 'ALLOCATED' and 'ASSIGNED' from sub_dt to replace the 'ALLOCATED' and 'ASSIGNED' values in dt based on the 'ID_CD' column. The output I would be after, based on my example, would still have 25 million rows but have 2,000 updated rows. Any help would be much appreciated. Thanks.

Chris
  • 1,197
  • 9
  • 28
  • Please make sure to keep your [minimal, Complete and Verifiable example](http://stackoverflow.com/help/mcve)... minimal – Sotos Feb 26 '17 at 13:21
  • 6
    I think something like ```dt[sub_dt, `:=` (ALLOCATED = i.ALLOCATED, ASSIGNED = i.ASSIGNED), on = .(ID_CD)]``` should do. Other than that, I really hope the way you've created the data set wasn't the actual syntax you've used in your real code. – David Arenburg Feb 26 '17 at 13:47
  • @DavidArenburg Thanks, that works great. And no, the syntax I used in my example is not what I am actually using. – Chris Feb 26 '17 at 15:25

1 Answers1

3

The answer provided by David Arenburg in his comment explains how to join the subset of modified data back into the original data.table.

However, I wonder why the OP doesn't apply the changes directly in the original data.table by reference using a function which returns a list:

my_fun <- function(alloc, assig) {
  list(
    alloc + 1,
    "B")
}

With this function the subset of rows can be updated directly within the data.table:

dt[5:2004, c("ALLOCATED", "ASSIGNED") := my_fun(ALLOCATED, ASSIGNED)]
dt[1:7]
#   AREA_CD ALLOCATED ASSIGNED ID_CD
#1:    1944         0        A   ID1
#2:    3265         0        A   ID2
#3:   15415         0        A   ID3
#4:   14121         0        A   ID4
#5:   10546         1        B   ID5
#6:    2263         1        B   ID6
#7:   12339         1        B   ID7

Benchmark

Due to memory limitations only a smaller data set with 2.5 million rows (instead of 25 million in the OP) is used.

library(microbenchmark)
setDT(df)  # coerce df to data.table
microbenchmark(
  copy = dt <- copy(df),
  join = {
    dt <- copy(df)
    sub_dt <- dt[5:2004,]
    sub_dt[,ALLOCATED:=ALLOCATED+1]
    sub_dt[,ASSIGNED:="B"]
    dt[sub_dt, `:=`(ALLOCATED = i.ALLOCATED, ASSIGNED = i.ASSIGNED), on = .(ID_CD)]
  },
  byref = {
    dt <- copy(df)
    dt[5:2004, c("ALLOCATED", "ASSIGNED") := my_fun(ALLOCATED, ASSIGNED)]
  },
  times = 10L
)
#Unit: milliseconds
#  expr       min        lq      mean    median        uq       max neval
#  copy  13.80400  14.07850  28.22882  14.15836  14.39643 154.70570    10
#  join 239.36476 240.72745 244.27668 243.52967 246.17104 255.06271    10
# byref  14.28806  14.47308  15.00056  14.63147  14.73134  18.71181    10

Updating the data.table "in place" is much faster than creating a subset and later join. The copy operation is required to start every benchmark run with an unmodified version of dt. Therefore, the copy operation is benchmarked as well.

data.tableversion 1.10.4 was used.

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • thanks for your answer. I hadn't thought of doing something like this. Your suggested route works. However, the issue I have with your solution is that I have set keys in my real version of dt and these are lost once your function is applied. – Chris Feb 27 '17 at 12:50
  • Glad, it works. Which fields are keyed? Can you show the result of `data.table::tables()` before and after you apply the function? Perhaps, it's better if you could post a new question with an appropriate [mcve] including the keys? – Uwe Feb 27 '17 at 14:10
  • I have asked a new question (http://stackoverflow.com/questions/42491624/matching-unique-and-non-unique-values-between-data-tables-and-update-data-table) relating to this one. Your answer made me think that I have been approaching the larger problem I am trying to solve wrong. I therefore need some help pointing me in the right direction! – Chris Feb 27 '17 at 17:10