Match rows on several variables efficiently - exclusion of columns

Question

I am working on a financial problem of deleting messages from a financial center. I am using data.table and I am very satisfied with its performance and easy handling.

Though, I ask myself always how to improve and use the whole power of data.table.

Here is an example of my task:

set.seed(1)
DT <- data.table(SYM = c(rep("A", 10), rep("B", 12)), PRC = format(rlnorm(22, 2), digits = 2), VOL = rpois(22, 312), ID = c(seq(1000, 1009), seq(1004, 1015)), FLAG = c(rep("", 8), "R", "A", rep("", 4), "R", rep("", 7)))
DT$PRC[9] <- DT$PRC[6]
DT$PRC[7] <- DT$PRC[6] 
DT$VOL[9] <- DT$VOL[6]
DT$VOL[7] <- DT$VOL[6]
DT$PRC[15] <- DT$PRC[13]
DT$VOL[15] <- DT$VOL[13]
## See the original dataset
DT
## Set the key
setkey(DT, "SYM", "PRC", "VOL", "FLAG")
## Get all rows, that match a row with FLAG == "R" on the given variables in the list
DT[DT[FLAG == "R"][,list(SYM, PRC, VOL)]]
## Remove these rows from the dataset
DT <- DT[!DT[FLAG == "R"][,list(SYM, PRC, VOL)]]
## See the modified data.table
DT

My questions are now:

Is this an efficient way to perform my task or does there exist something more 'data.table' style? Is the key set efficiently?
How can I perform my task if I do not only have three variables to match on (here: SYM, PRC, VOL) but a lot more, does there exist something like exclusion (I do know I can use it data.frame style but I want to know if there is a more elegant way for a data.table)?
What is with the copying in the last command? Following the thread on remove row by reference, I think copying is the only way to do it. What if I have several tasks, can I compound them in a way and avoid copying for each task?

+1 Is that data random? I see an `rlnorm` (though I'm not sure what that is). Could you use `set.seed(1)` (or some other seed) at the top so we're all on the same page? You're doing pretty much what I would do, though I'd keep two versions of the table DT1 <- DT[!...], since I'm not used to hitting memory constraints. — Frank, Oct 17 '13 at 15:52
@Frank Thanks for this comment! You are true! And my data is not random at all :) — Simon Z., Oct 17 '13 at 19:28

Frank · Answer 1 · 2013-10-17T18:25:40.727

If you are only setting the key to perform this operation, @eddi's answer is the best and easiest to read.

setkey(DT, SYM, PRC, VOL)
# ^ as in @eddi's answer, since you are not using the rest of the key
microbenchmark(
    notjoin=DT[!DT[FLAG == "R"][,list(SYM, PRC, VOL)]],
    logi_not=DT[!DT[,rep(any(FLAG=='R'),.N),by='SYM,PRC,VOL']$V1],
    idx_not=DT[!DT[,if(any(FLAG=='R')){.I}else{NULL},by='SYM,PRC,VOL']$V1],
    SD=DT[,if(!any(FLAG=='R')){.SD}else{NULL},by='SYM,PRC,VOL'],
    eddi=DT[!DT[FLAG == "R"]],
    times=1000L
)

results:

Unit: milliseconds
     expr      min       lq   median       uq       max neval
  notjoin 4.983404 5.577309 5.715527 5.903417 66.468771  1000
 logi_not 4.393278 4.960187 5.097595 5.273607 66.429358  1000
  idx_not 4.523397 5.139439 5.287645 5.453129 15.068991  1000
       SD 3.670874 4.180012 4.308781 4.463737  9.429053  1000
     eddi 2.767599 3.047273 3.137979 3.255680 11.970966  1000

On the other hand, several of options above do not require that your operation involve grouping by the key. Suppose you either...

are doing this once using groups other than the key (which you don't want to change) or
want to perform several operations like this using different groupings before doing the copy operation to drop rows, newDT <- DT[...] (as mentioned in the OP's point 3).

.

setkey(DT,NULL)
shuffDT <- DT[sample(1:nrow(DT))] # not realistic, of course
# same benchmark with shuffDT, but without methods that require a key
# Unit: milliseconds
#      expr      min       lq   median       uq      max neval
#  logi_not 4.466166 5.120273 5.298174 5.562732 64.30966  1000
#   idx_not 4.623821 5.319501 5.517378 5.799484 15.57165  1000
#        SD 4.053672 4.448080 4.612213 4.849505 66.76140  1000

In these cases, the OP's and eddi's methods are not available (since joining requires a key). For a one-off operation, using .SD seems faster. For subsetting by multiple criteria, you'll want to keep track of the rows you want to keep/drop before making the copy newDT <- DT[!union(badrows1,badrows2,...)].

DT[,rn:=1:.N] # same as .I
badflagrows <- DT[,if(any(FLAG=='R')){rn}else{NULL},by='SYM,PRC,VOL']$V1
# fill in next_cond, next_grp
badnextrows <- DT[!badflagrows][,
    if(any(next_cond)){rn}else{NULL},by='next_grp']$V1

Perhaps something similar can be done with the logical subsetting ("logi_not" in the benchmarks), which is a little faster.

score 1 · Answer 2 · answered Oct 17 '13 at 16:24

1

I'm confused why you're setting the key to FLAG, isn't what you want simply

setkey(DT, SYM, PRC, VOL)

DT[!DT[FLAG == "R"]]

answered Oct 17 '13 at 16:24

eddi

49,088
6
104
155

@eddi Indeed, it is. Good to know, I thought, that the table must also have "FLAG" in its key to be able to search for "R" in this variable. Thank you very much for your help! – Simon Z. Oct 17 '13 at 19:30

Match rows on several variables efficiently - exclusion of columns

2 Answers2