0

I have my data.table DT which I'd like to filter and keep only the rows where any of the values in that row contain the string "tokeep".

library(data.table)     
cola <- c(1:5)
colb <- c(letters[1:4], "tokeep")
dt <- data.table(cola, colb)
dt
   cola   colb
1:    1      a
2:    2      b
3:    3      c
4:    4      d
5:    5 tokeep

expected result :

 dt[grepl("tokeep", colb)]
   cola   colb
1:    5 tokeep

However I don't know in which column tokeep will be found. I have tried using .SD in i like this

 dt[any(grepl("tokeep", .SD))]
Empty data.table (0 rows) of 2 cols: cola,colb

Also, can't figure out the following.

> dt[,print(any(grepl("tokeep", .SD)))]
[1] TRUE
[1] TRUE

Shouldn't it be FALSE, TRUE since "tokeep" only exists in colb?

gaut
  • 5,771
  • 1
  • 14
  • 45
  • 2
    `.SD` is only defined in `j`. in `i`, `.SD` is `NULL`. you can use `dt[rowSums(dt=="tokeep") > 0]`. there should be a dupe somewhere – chinsoon12 Jul 18 '19 at 09:05
  • how to use `grepl` instead of a strict equality? the string must contain tokeep, not be exactly equal to it – gaut Jul 18 '19 at 09:09
  • maybe `dt[dt[, .I[Reduce(`|`, lapply(.SD, grepl, pattern="tokeep"))]]]`. not sure if this can be more succinct. but u can search around for a base R solution and recode it in data.table syntax like `dt[dt[apply(dt, 1L, function(x) any(grepl("tokeep", x))), which=TRUE]]` – chinsoon12 Jul 18 '19 at 09:12
  • I guess the question is equivalent to "how to apply a function to each row of dt"... – gaut Jul 18 '19 at 09:13
  • just updated my comment...`apply(DF, 1L, function(row) ...)` and i suspect it will be faster with base R if u keep it as a character matrix – chinsoon12 Jul 18 '19 at 09:14
  • Indeed, but then I loose the nice data.table speed. Isn't there a built-in way? – gaut Jul 18 '19 at 09:15
  • 3
    If you need to loop over rows, you usually should rearrange your data. Looping over rows of a `data.frame` (or `data.table`) is slow. Nothing package data.table can do about that. Looping over rows of a matrix should be faster, but still not fast. The problem are then repeated calls to a closure (in this case `grepl`) and for `grepl` specifically you are also repeatedly interfacing with the regex engine. – Roland Jul 18 '19 at 09:27
  • How should I rearrange the data? Any specific rules, or would the definition of a key using eg `keyby` help? – gaut Jul 18 '19 at 09:43
  • 2
    Reshape to long format. Then you can do one `grepl` call for all data. – Roland Jul 18 '19 at 10:11

1 Answers1

0

After reading this post I think this might be a consise, more data.tably way to apply a function to all lines of a data.table. Interested in any other propositions.

> dt[dt[, any(grepl("tokeep", .SD)), by = seq_len(nrow(dt))]$V1]
   cola   colb
1:    5 tokeep
gaut
  • 5,771
  • 1
  • 14
  • 45