I am working on a financial problem of deleting messages from a financial center. I am using data.table and I am very satisfied with its performance and easy handling.
Though, I ask myself always how to improve and use the whole power of data.table.
Here is an example of my task:
set.seed(1)
DT <- data.table(SYM = c(rep("A", 10), rep("B", 12)), PRC = format(rlnorm(22, 2), digits = 2), VOL = rpois(22, 312), ID = c(seq(1000, 1009), seq(1004, 1015)), FLAG = c(rep("", 8), "R", "A", rep("", 4), "R", rep("", 7)))
DT$PRC[9] <- DT$PRC[6]
DT$PRC[7] <- DT$PRC[6]
DT$VOL[9] <- DT$VOL[6]
DT$VOL[7] <- DT$VOL[6]
DT$PRC[15] <- DT$PRC[13]
DT$VOL[15] <- DT$VOL[13]
## See the original dataset
DT
## Set the key
setkey(DT, "SYM", "PRC", "VOL", "FLAG")
## Get all rows, that match a row with FLAG == "R" on the given variables in the list
DT[DT[FLAG == "R"][,list(SYM, PRC, VOL)]]
## Remove these rows from the dataset
DT <- DT[!DT[FLAG == "R"][,list(SYM, PRC, VOL)]]
## See the modified data.table
DT
My questions are now:
- Is this an efficient way to perform my task or does there exist something more 'data.table' style? Is the key set efficiently?
- How can I perform my task if I do not only have three variables to match on (here: SYM, PRC, VOL) but a lot more, does there exist something like exclusion (I do know I can use it data.frame style but I want to know if there is a more elegant way for a data.table)?
- What is with the copying in the last command? Following the thread on remove row by reference, I think copying is the only way to do it. What if I have several tasks, can I compound them in a way and avoid copying for each task?