3

When I do text analysis, I frequently want to figure out whether a large number of documents contains any element of a list of strings. If I have millions of documents (e.g. tweets) and a long list of patterns, this can take a long time.

I usually use the following packages to optimize for speed: data.table dtplyr stringr

What are some best practices to optimize string detection and analysis thereof? Are there packages that would allow me to optimize code like this:

library(data.table)
library(dtplyr)
library(stringr)

my_dt <- data.table(text = c("this is some text", "this is some more text")) #imagine many more strings
my_string <- paste(words, collapse = "|")

lazy_dt(my_dt, immutable = F) %>%
filter(filtered_text = str_detect(text, my_string)) %>%
as.data.table()

I would assume that using data.table directly instead of the dtplyr implementation would increase speed. Are there any other ways to improve performance for this kind of application?


I looked at this question and was hoping I could get some similar guidance. Hopefully, the question is specific enough as it is now.

Tea Tree
  • 882
  • 11
  • 26
  • `str_detect(text, my_string)` is the bottleneck in your code. Using pure data.table / stringi would only slightly improve the speed. Once the question is reopened I will post an answer that is faster when using data.table. On 30000 records I gain about an 8 fold increase in speed compared to your original code. – phiver Jul 28 '20 at 13:25
  • Awesome, I really appreciate it. Is there a way to speed up the re-opening of my question? – Tea Tree Jul 28 '20 at 18:45
  • If performance is the goal, my first question is why use R? Why not C/C++? Second, what input changes least often? The list of patterns? Why not pre-process the list of patterns into ad-hoc C++ code? That kind of code is hard to beat. – Mike Dunlavey Jul 29 '20 at 13:07

1 Answers1

2

As I mentioned in the comments str_detect(text, my_string) is the bottleneck in your code. Also note that is does not exactly do what you are expecting. It does a regex search, so all the words that have an "a" in the text would be counted as well. See examples below.

library(data.table)
library(dtplyr)
library(stringr)
library(dplyr)


my_dt <- data.table(id = 1:300000,
                    text = rep(c("this is some text", "this is some more text", 
                             "text palabras"), 100000)) #imagine many more strings
my_string <- paste(stringr::words, collapse = "|")

# start counting time (note System.time() is slightly faster but doesn't print the results)
timing <- Sys.time()

run code
lazy_dt(my_dt, immutable = F) %>%
  filter(filtered_text = str_detect(text, my_string)) %>%
  as.data.table()

            id                   text
     1:      1      this is some text
     2:      2 this is some more text
     3:      3          text palabras
     4:      4      this is some text
     5:      5 this is some more text
    ---                              
299996: 299996 this is some more text
299997: 299997          text palabras
299998: 299998      this is some text
299999: 299999 this is some more text
300000: 300000          text palabras

Sys.time() - timing
Time difference of 6.708245 secs

Note: the equivalent data.table code of your code above is the following:

my_dt[str_detect(text, my_string), ]

Timing this is about 6.52 seconds, so not much of an improvement.

As you can see from the result above, this selection returns all the sentences because there is an "a" in palabras. This shouldn't be here. Now data.table has a function called %chin% which is like %in% but for character vectors and a lot faster. To get the match on words we just need to tokenize the lot, which can be done with unnest_tokens from tidytext. This function respects the data.table format. Afterwards I filter the data on the matching words, drop the word column and take distinct (unique) the data.table. The reason is that the result can have duplicate lines as multiple words can be a match. Even though there are more function calls this is about 3 times as fast.

library(tidytext)

timing <- Sys.time()
my_dt <- unnest_tokens(my_dt, word, text, drop = F)
my_dt <- unique(my_dt[word %chin% words, ], by = c("id", "text"))[, c("id", "text")]


           id                   text
     1:     1      this is some text
     2:     2 this is some more text
     3:     4      this is some text
     4:     5 this is some more text
     5:     7      this is some text
    ---                             
199996: 299993 this is some more text
199997: 299995      this is some text
199998: 299996 this is some more text
199999: 299998      this is some text
200000: 299999 this is some more text

Sys.time() - timing
Time difference of 2.380911 secs

Now to speed things up a bit more, you can set the threads data.table uses. By default (on my system) this is set to 2. You can check this with getDTthreads(). When I add 1 thread with setDTthreads(3) the new code returns in about 1.6 secs.

Now maybe someone can speed this up a bit more, by doing this in the .SD part of data.table.

phiver
  • 23,048
  • 14
  • 44
  • 56