2

I have to deal with huge data sets, that is why I have to use data.table package. I would like to pick only rows which have appropriate values in given column.

dt <- data.table(a = rep(c("A", "B", "C"), 3), 
                 b = 1:9)
n <- c("A", "C")

What I can do:

dt[ a %in% n]
   a b
1: A 1
2: C 3
3: A 4
4: C 6
5: A 7
6: C 9

What if I do not know name of column before and get it from function as string? I tried:

dt[ "a" %in% n]
Empty data.table (0 rows) of 2 cols: a,b

dt[ "a" %in% n, with  = F]
Error in `[.data.table`(dt, "a" %in% n, with = F) : 
  j must be provided when with=FALSE

dt[ as.name("a") %in% n ]
Error in match(x, table, nomatch = 0L) : 
  'match' requires vector arguments

Question - is it possible to use string in such a task?

  • 3
    Try `dt[n, on = "a"]` or `dt[eval(as.name("a")) %in% n]` or `setkeyv(dt, "a")[n]` – David Arenburg Mar 14 '17 at 12:32
  • @David Might as well make that an answer. The `eval` one should be on the entirety of `i` not just the name, I guess, for autoindexing to be used. – Frank Mar 14 '17 at 15:48
  • @Frank don't we have some dupe for this somewhere? Maybe [this](http://stackoverflow.com/questions/27677283/evaluating-both-column-name-and-the-target-value-within-j-expression-within-d) ? – David Arenburg Mar 14 '17 at 15:59
  • @DavidArenburg Thank you very much! Now I can cast this spell! – Pesto Verde Mar 16 '17 at 08:10

2 Answers2

1

Could achieve this in various ways

Either using joins (which will return an ordered result, without sorting dt in place)

dt[.(n), on = "a"] 

Or (which will also return an ordered result while sorting dt in place)

setkeyv(dt, "a")[.(n)]

Or using eval/as.name (which will return an unsorted result)

dt[eval(as.name("a")) %in% n]

Or using a list subset (which will return an unsorted result)

dt[dt[["a"]] %in% n]

A few points you might want to take consideration of.

in v 1.9.4 data.table package introduced secondary indexes. This means that when you do a vector scan (== or %in%) it will create a some sort or a key (similarly as it would if you will use setkey explicitly) which will make the first run a bit slower but will significantly increase performance the next time you will search for matches in this column.

Though, not in all cases the secondary key will be (permanently) set. In some cases it will be always a simple vector scan (e.g. dt[eval(as.name("a")) %in% n, verbose = TRUE] or dt[get('a') %in% n] or dt[dt[["a"]] %in% n, verbose = TRUE]) which will be faster if you only run this once. In some case an ad-hoc index will be created within each run (e.g. dt[n, on = "a", verbose = TRUE], but will also use an index if it already exists) which is sub-optimal, while in some cases a permanent key or a secondary key will be set (e.g. setkeyv(dt, "a")[.(n), verbose = TRUE] or as proposed by @Frank dt[eval(substitute(col %in% n, list(col=as.name("a")))), verbose = TRUE])

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • What is about explicitely setting a secondary index with `setindex()` instead of `setkey()`? Compare `DT <- data.table(a = 1:2); DT[.(1L), on=.(a), verbose = TRUE]; DT[.(1L), on=.(a), verbose = TRUE]` with `DT[.(1L), on=.(a), verbose = TRUE]; setindex(DT, a); DT[.(1L), on=.(a), verbose = TRUE]`. – Uwe Mar 15 '17 at 12:11
0

Use get function

dt[get('a') %in% n]

It also works if colname is stored in another variable:

temp <- 'a'
dt[get(temp) %in% n]
Feng
  • 603
  • 3
  • 9