First, some observations about the approach:
grepl
and sum
are vectorized,
- the loop is growing a vector (bad practice),
- Every word is separated by a space, i.e. the delimiter is fixed.
Making a sample dictionary, roughly 11k Dutch words:
library(rvest)
library(stringi)
l <- list(
c("Natuurlijk we kunnen niet anders"),
c("wil jij honderden kinderen de"),
c("van alle geestelijke leiders is")
)
dutch <- read_html("https://cooljugator.com/nl/list/all") %>%
html_elements("a") %>%
html_attr("href") %>%
stri_extract_all_words(simplify = TRUE) %>%
.[,2] %>%
stri_remove_empty() %>%
.[7:length(.)]
lapply speed diminishes if the list grows, use vapply
or a loop that writes to a correctly initialized vector instead. Further, Base R %in%
can be optimized, as is done in the fastmatch
package.
library(fastmatch)
f <- function(data, dictionary) {
vapply(data, \(x){
sum(fmatch(
strsplit(x, " ", fixed = TRUE)[[1]],
dictionary, nomatch = 0L) > 0L
)
}, 1L)
}
f(l, dutch)
#> [1] 1 2 0
After an optimization, a customary benchmark:
library(data.table)
tw_copy = data.table(
author=c("a","b","c"),
text = c("Natuurlijk we kunnen niet anders",
"wil jij honderden kinderen de",
"van alle geestelijke leiders is")
)
bench::mark(
x = f(l, dutch),
y = {
tw_copy[, token:=list(strsplit(text," "))]
tw_copy[, num_dutch_words:=sum(token[[1]] %in% dutch), by=1:nrow(tw_copy)]
}, check = F
)[c(1,3,5,7,9)]
expression median mem_alloc n_itr total_time
<bch:expr> <dbl> <bch:byt> <int> <dbl>
1 x 0.0141 0B 9999 168.
2 y 3.65 706KB 127 459.
Note that the output is not the same, so the benchmark is just an indication, actual result will depend on your input and desired output. Other optimizations could be done to the data structure, for example removing stopwords prior to looping through dictionary that contains verbs.
For optimization it is useful to go through a checklist.
- Is there a vectorized approach?
- Is my data structure suitable for the task?
- Is my loop properly initializing vectors?
If the speed is still slow, other approaches can be considered.