3

I need to create a new variable in the dataset below:

A  X
a  1
b  2
c  3
d  4
e  5
f  6
g  7
h  8
i  9
j 10

The newvar will have value 1 if X equals 2,5,7 or 9. Otherwise, newvar should be 0.

Code:

dt1 <- data.table(A = letters[1:10], X = 1:10, key = "X")
numberlist <- list(2,5,7,9)

I have tried the following based on a post here:

dt1[, newvar:=.SD, .SDcols = 0][%in% numberlist, newvar:=.SD, .SDcols = 1]
dt1[, newvar:=.SD, .SDcols = 0][X %in% numberlist, newvar:=.SD, .SDcols = 1]

dt1[, newvar:=.SD, .SDcols = 0] means "assign value of 0 to newvar as default option. The second bracket [%in% numberlist, newvar:=.SD, .SDcols = 1] means "if the key (X) is included in the numberlist, set the newvar value to 1.

Any idea why it is not working?

Community
  • 1
  • 1
user3507584
  • 3,246
  • 5
  • 42
  • 66

1 Answers1

3

Try

dt1[, newvar:=(X %in% c(2,5,7,9))+0L][]
#     A  X newvar
# 1: a  1      0
# 2: b  2      1
# 3: c  3      0
# 4: d  4      0
# 5: e  5      1
# 6: f  6      0
# 7: g  7      1
# 8: h  8      0
# 9: i  9      1
#10: j 10      0

Or if we already have the matching elements stored in a a vector

numberlist <- c(2,5,7,9)
dt1[, newvar:=as.numeric(X %in% numberlist)] 

as.numeric is another option to coerce the logical vector to 0/1 values.

akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks @akrun. It works but I still do not understand what's the logic behind. Why using two sets of `[]`. What was wrong in my code for not working? – user3507584 Apr 13 '15 at 16:53
  • 1
    @user3507584 The last set `[]` is just to print the output on the console. The logic is using `%in%` to create the logical condition that will be TRUE when we match 2, 5, 7,9 with X and `+0L` coerces the logical vector to numeric 0, 1, you can also use `as.numeric` instead of `+0L` – akrun Apr 13 '15 at 16:55
  • 1
    maybe better (for reading and beginner) to use `transform(dt1, newvar=(dt1$X %in% c(2,5,7,9))+0L)` ? – Colonel Beauvel Apr 13 '15 at 16:56
  • 1
    @ColonelBeauvel That would be for a data.frame code. But, you don't need `dt1$X`, just ``X %in% ` would be sufficient. The OP created a data.table. – akrun Apr 13 '15 at 16:57
  • 1
    @user3507584 Just saw your `numberlist`. It is better to create a vector i.e. `numberlist <- c(2,5,7,9)` – akrun Apr 13 '15 at 16:59
  • @user3507584 Other thing I noticed in your code is using `.SDcols=0` and then assigning `newvar` to `.SD`. I am not sure what you were trying there. Also, it gave error message `dt1[,.SD,.SDcols = 0]# Error in `[.data.table`(dt1, , .SD, .SDcols = 0) : .SDcols is numeric but out of bounds (or NA)` – akrun Apr 13 '15 at 17:11
  • @akrun I was trying to replicate the example that I reference in the post. I opted for that because in the same post they compared the performance of three different alternatives and this was the fastest one (and I have 50M observations) – user3507584 Apr 13 '15 at 17:15
  • 1
    @user3507584 `.SDcols` is for selecting the columns. For example if we need to multiple the 2nd column by 2, and create that as a newvar `dt1[, newvar:=.SD*2, .SDcols=2]` – akrun Apr 13 '15 at 17:17
  • @user3507584 If you have 50M obs, but a smaller number of unique values in `x`, you could try akrun's code with a `by=x` for speed (so it will test the `%in%` once per value of `x` instead of for each row). – Frank Apr 13 '15 at 18:59