As I mentioned in the comments str_detect(text, my_string)
is the bottleneck in your code. Also note that is does not exactly do what you are expecting. It does a regex search, so all the words that have an "a" in the text would be counted as well. See examples below.
library(data.table)
library(dtplyr)
library(stringr)
library(dplyr)
my_dt <- data.table(id = 1:300000,
text = rep(c("this is some text", "this is some more text",
"text palabras"), 100000)) #imagine many more strings
my_string <- paste(stringr::words, collapse = "|")
# start counting time (note System.time() is slightly faster but doesn't print the results)
timing <- Sys.time()
run code
lazy_dt(my_dt, immutable = F) %>%
filter(filtered_text = str_detect(text, my_string)) %>%
as.data.table()
id text
1: 1 this is some text
2: 2 this is some more text
3: 3 text palabras
4: 4 this is some text
5: 5 this is some more text
---
299996: 299996 this is some more text
299997: 299997 text palabras
299998: 299998 this is some text
299999: 299999 this is some more text
300000: 300000 text palabras
Sys.time() - timing
Time difference of 6.708245 secs
Note: the equivalent data.table code of your code above is the following:
my_dt[str_detect(text, my_string), ]
Timing this is about 6.52 seconds, so not much of an improvement.
As you can see from the result above, this selection returns all the sentences because there is an "a" in palabras. This shouldn't be here. Now data.table has a function called %chin%
which is like %in%
but for character vectors and a lot faster. To get the match on words we just need to tokenize the lot, which can be done with unnest_tokens
from tidytext. This function respects the data.table format. Afterwards I filter the data on the matching words, drop the word column and take distinct (unique) the data.table. The reason is that the result can have duplicate lines as multiple words can be a match. Even though there are more function calls this is about 3 times as fast.
library(tidytext)
timing <- Sys.time()
my_dt <- unnest_tokens(my_dt, word, text, drop = F)
my_dt <- unique(my_dt[word %chin% words, ], by = c("id", "text"))[, c("id", "text")]
id text
1: 1 this is some text
2: 2 this is some more text
3: 4 this is some text
4: 5 this is some more text
5: 7 this is some text
---
199996: 299993 this is some more text
199997: 299995 this is some text
199998: 299996 this is some more text
199999: 299998 this is some text
200000: 299999 this is some more text
Sys.time() - timing
Time difference of 2.380911 secs
Now to speed things up a bit more, you can set the threads data.table uses. By default (on my system) this is set to 2. You can check this with getDTthreads()
. When I add 1 thread with setDTthreads(3)
the new code returns in about 1.6 secs.
Now maybe someone can speed this up a bit more, by doing this in the .SD part of data.table.