6

I have a data.table X that I would like to create a variable based on 2 character variables

   X[, varC :=((VarA =="A" & !is.na(VarA)) 
               | (VarA == "AB" & VarB =="B" & !is.na(VarA) & !is.na(VarB))
                )
      ]

This code works but it is very slow, because it does vector scan on 2 char variables. Note that I don't setkey claims4 table by VarA and VarB. Is there a "right" way to do this in data.table?

Update 1: I don't use setkey for this transformation because I already use setkey(X, Year, ID) for other variable transformations. If I do, I need to reset keys back to Year, ID after this transformation.

Update 2: I did benchmark my approach with Matthew's approach, and his is much faster:

          test replications elapsed relative user.self sys.self user.child sys.child
2 Matthew               100   3.377    1.000     2.596    0.605          0         0
1 vectorSearch          100 200.437   59.354    76.628   40.260          0         0

The only minor thing is setkey then re-setkey again is somewhat verbose :)

mnel
  • 113,303
  • 27
  • 265
  • 254
AdamNYC
  • 19,887
  • 29
  • 98
  • 154
  • There might also be some unnecessary coding there; e.g if VarA == "AB" is TRUE, then it will also always be TRUE that !is.na(VarA), right? – Marc in the box Dec 01 '12 at 13:45
  • Hi Marc, the !is.na is required. Otherwise, if VarA is missing, then condition VarA=="AB" will return NA instead of 0 as I would like – AdamNYC Dec 01 '12 at 15:14
  • Hi Wojciech, I don't use setkey because I already use setkey in previous variable transformations. This is just one of the many variable creation steps I have to do, so I would like to avoid setkey if necessary (otherwise, I need to reset keys after completing this transformation). – AdamNYC Dec 01 '12 at 15:16
  • @AdamNYC setting key is fast and is required to speedup calculations when you use data.table – Wojciech Sobala Dec 02 '12 at 13:08

1 Answers1

6

How about :

setkey(X,VarA,VarB)
X[,varC:=FALSE]
X["A",varC:=TRUE]
X[J("A","AB"),varC:=TRUE]

or, in one line (to save repetitions of the variable X and to demonstrate) :

X[,varC:=FALSE]["A",varC:=TRUE][J("A","AB"),varC:=TRUE]

To avoid setting the key, as requested, how about a manual secondary key :

S = setkey(X[,list(VarA,VarB,i=seq_len(.N))],VarA,VarB)
X[,varC:=FALSE]
X[S["A",i][[2]],varC:=TRUE]
X[S[J("A","AB"),i][[3]],varC:=TRUE]

Now clearly, that syntax is ugly. So FR#1007 Build in secondary keys is to build that into the syntax; e.g.,

set2key(X,varA,varB)
X[...some way to specify which key to join to..., varC:=TRUE]

In the meantime it's possible, just manually, as shown above.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • Hi Matthew, please see my update for the reason not to avoid setkey here. But may be setkey then reset key still be faster than vector search :) – AdamNYC Dec 01 '12 at 15:21
  • Thanks a lot, Mat. I learn many new things today. set2key would be lovely. For now, it seems to me that set and resetting keys, although adding two more lines to the code, is easier to read. It did not create another (small) dataset (i.e., S in your example). – AdamNYC Dec 01 '12 at 16:55