1

I want to search a word from a Dictionary using R. I used grepl for searching word, x from dictionary dutch. Finally I made this in such way, it will return how many words in a sentence are from this dictionary. My script as follow:

tw_copy$token <- tokenize_tweets(tw_copy$text)

count = function(x){
  de = c()
  other = c()
    for (token in x){
      if (grepl(token, dutch)==TRUE){
        de <- c(de, token)
      }
      else{
        other <- c(other, token)
      }
    }
  return (c(length(de)/length(x), length(other)/length(x)))
}

result <- lapply(tw_copy$token, FUN = count)

tw_copy$de =  lapply(result, "[[", 1)

Now, the output is right. But It is really slow and can not get output for bigger dataset.

Can anyone suggest me to write it other way for faster performance?

word dictionary

dataset

Shantanu Nath
  • 363
  • 3
  • 13
  • can you share the first 5 or 10 items in `tw_copy$token` and `dutch`? – langtang Feb 12 '22 at 19:29
  • I made an edit of my questions. Please take a look again. first one is for *dutch* and second one is for tw_copy. @langtang – Shantanu Nath Feb 12 '22 at 19:42
  • See the %in% function. Listofwords %in% dictionary will return a logical vector the same length of the original list showing membership in the dictionary. – Dave2e Feb 12 '22 at 19:56
  • Optimization problems usually depend on your input size, the size of queries at one time, where words are found (looks like you are looking for verbs, which usually do not occur as the first word in Dutch) et cetera. Some optimizations to consider. 0. Stem the words. **1:** reshape the dictionary into tree-like structure, then use a `sort` followed by `switch`. **2**: Using [environments](https://www.r-bloggers.com/2019/01/hash-me-if-you-can/) for lookup. **3:** *memoise* lookup for words like *ik, jij, de, het, is*. **4:** use `stringi::stri_detect_fixed`. – Donald Seinen Feb 13 '22 at 07:28

2 Answers2

2

I would do something like this:

base R

tw_copy$num_dutch_words = lapply(tw_copy$token, \(x) sum(x %in% dutch))

Input

dutch = c("this", "tweet", "one")
tw_copy = data.frame(
  author=c("a","b","c"),
  text = c("this is the first tweet",
           "this one is the second tweet",
           "and this one is the third tweet")
)
tw_copy$token = lapply(tw_copy$text, \(x) strsplit(x, " ")[[1]])

data.table

(this was my original answer, as I assumed you might have some speed up with very large dataset, but my assumption may be wrong)

tw_copy[, num_dutch_words:=sum(token[[1]] %in% dutch), by=1:nrow(tw_copy)]

Input Data:

dutch = c("this", "tweet", "one")

tw_copy = data.table(
  author=c("a","b","c"),
  text = c("this is the first tweet",
           "this one is the second tweet",
           "and this one is the third tweet")
)
tw_copy[, token:=list(strsplit(text," "))]

In both cases, output like this:

   author                            text                         token num_dutch_words
1:      a         this is the first tweet       this,is,the,first,tweet               2
2:      b    this one is the second tweet  this,one,is,the,second,tweet               3
3:      c and this one is the third tweet and,this,one,is,the,third,...               3
langtang
  • 22,248
  • 1
  • 12
  • 27
2

First, some observations about the approach:

  • grepl and sum are vectorized,
  • the loop is growing a vector (bad practice),
  • Every word is separated by a space, i.e. the delimiter is fixed.

Making a sample dictionary, roughly 11k Dutch words:

library(rvest)
library(stringi)
l <- list(
  c("Natuurlijk we kunnen niet anders"),
  c("wil jij honderden kinderen de"),
  c("van alle geestelijke leiders is")
)

dutch <- read_html("https://cooljugator.com/nl/list/all") %>%
  html_elements("a") %>%
  html_attr("href") %>%
  stri_extract_all_words(simplify = TRUE) %>%
  .[,2] %>%
  stri_remove_empty() %>%
  .[7:length(.)]

lapply speed diminishes if the list grows, use vapply or a loop that writes to a correctly initialized vector instead. Further, Base R %in% can be optimized, as is done in the fastmatch package.

library(fastmatch)
f <- function(data, dictionary) {
  vapply(data, \(x){
    sum(fmatch(
        strsplit(x, " ", fixed = TRUE)[[1]],
        dictionary, nomatch = 0L) > 0L
    )
  }, 1L)
}
f(l, dutch)
#> [1] 1 2 0

After an optimization, a customary benchmark:

library(data.table)
tw_copy = data.table(
  author=c("a","b","c"),
  text = c("Natuurlijk we kunnen niet anders",
           "wil jij honderden kinderen de",
           "van alle geestelijke leiders is")
)

bench::mark(
  x = f(l, dutch),
  y = {
    tw_copy[, token:=list(strsplit(text," "))]
    tw_copy[, num_dutch_words:=sum(token[[1]] %in% dutch), by=1:nrow(tw_copy)]
  }, check = F
)[c(1,3,5,7,9)]

  expression median mem_alloc n_itr total_time
  <bch:expr>  <dbl> <bch:byt> <int>      <dbl>
1 x          0.0141        0B  9999       168.
2 y          3.65       706KB   127       459.

Note that the output is not the same, so the benchmark is just an indication, actual result will depend on your input and desired output. Other optimizations could be done to the data structure, for example removing stopwords prior to looping through dictionary that contains verbs.

For optimization it is useful to go through a checklist.

  • Is there a vectorized approach?
  • Is my data structure suitable for the task?
  • Is my loop properly initializing vectors?

If the speed is still slow, other approaches can be considered.

Donald Seinen
  • 4,179
  • 5
  • 15
  • 40