2

I have a large text body where I want to replace words with their respective synonyms efficiently (for example replace all occurrences of "automobile" with the synonym "car"). But I struggle to find a proper (efficient way) to do this.

For the later analysis I use the text2vec library and would like to use that library for this task as well (avoiding tm to reduce dependencies).

An (inefficient) way would look like this:

# setup data
text <- c("my automobile is quite nice", "I like my car")

syns <- list(
  list(term = "happy_emotion", syns = c("nice", "like")),
  list(term = "car", syns = c("automobile"))
)

My brute-force solution is to have something like this and use a loop to look for the words and replace them

library(stringr)
# works but is probably not the best...
text_res <- text
for (syn in syns) {
  regex <- paste(syn$syns, collapse = "|")
  text_res <-  str_replace_all(text_res, pattern = regex, replacement = syn$term)
}
# which gives me what I want
text_res
# [1] "my car is quite happy_emotion" "I happy_emotion my car" 

I used to do it with tm using this approach by MrFlick (using tm::content_transformer and tm::tm_map), but I want to reduce the dependencies of the project by replacing tm with the faster text2vec.

I guess the optimal solution would be to somehow use text2vecs itoken, but I am unsure how. Any ideas?

Community
  • 1
  • 1
David
  • 9,216
  • 4
  • 45
  • 78

3 Answers3

3

Quite late, but still I want to add my 2 cents. I see 2 solutions

  1. Small improvement over your str_replace_all. Since it is vectorized internally you can make all replacements without loop. I think it will be faster, but I didn't make any benchmarks.

    regex_batch = sapply(syns, function(syn) paste(syn$syns, collapse = "|"))  
    names(regex_batch) = sapply(syns, function(x) x$term)  
    str_replace_all(text, regex_batch)  
    
  2. Naturally this task is for hash-table lookup. Fastest implementation as far as I know is in fastmatchpackage. So you can write custom tokenizer, something like:

    library(fastmatch)
    
    syn_1 = c("nice", "like")
    names(syn_1) = rep('happy_emotion', length(syn_1))
    syn_2 = c("automobile")
    names(syn_2) = rep('car', length(syn_2))
    
    syn_replace_table = c(syn_1, syn_2)
    
    custom_tokenizer = function(text) {
      word_tokenizer(text) %>% lapply(function(x) {
        i = fmatch(x, syn_replace_table)
        ind = !is.na(i)
        i = na.omit(i)
        x[ind] = names(syn_replace_table)[i]
        x
      })
    }
    

I would bet that second solution will work faster, but need to make some benchmarks.

Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • 1
    That looks like a very interesting concept! I did some quick benchmarks on them. While on small synonym-samples, the for-loop is faster, the `fastmatch` approach is a lot faster on larger lists! Also, as I am still working on the project, your 2 cents are very valuable here! – David Jan 26 '17 at 06:46
  • 1
    Also note, that `text2vec::word_tokenizer` is quite slow compared to `stringr::str_split(TEXT_HERE, pattern = stringr::boundary("word"))`. The only reason I'm not using `stringr`/`stringi`/`tokenizers` is that I want to keep number of `text2vec` dependencies as small as possible. – Dmitriy Selivanov Jan 26 '17 at 07:38
2

With base R this should work:

mgsub <- function(pattern,replacement,x) {
if (length(pattern) != length(replacement)){
    stop("Pattern not equal to Replacment")
} 
    for (v in 1:length(pattern)) {
        x  <- gsub(pattern[v],replacement[v],x)
    }
return(x )
}

mgsub(c("nice","like","automobile"),c(rep("happy_emotion",2),"car"),text)
count
  • 1,328
  • 9
  • 16
  • Isn't this very similar to the loop I posted? Just replacing `stringr::str_replace_all` with `gsub`? – David Jan 11 '17 at 09:54
  • It is, but as you wanted to reduce dependencies I figured you wanted a base R solution. Have you checked out [this](http://stackoverflow.com/questions/29273928/faster-approach-than-gsub-in-r) ? – count Jan 11 '17 at 10:00
  • I was aware of the `perl = T` (which indeed adds speed), and it seems to be faster than `stringr`. But still, I wonder if `text2vec` offers a faster alternative (by using the tokens...) – David Jan 11 '17 at 10:08
1

The first part of solution by Dmitriy Selivanov requires a small change.

library(stringr)    

text <- c("my automobile is quite nice", "I like my car")

syns <- list(
             list(term = "happy_emotion", syns = c("nice", "like")),
             list(term = "car", syns = c("automobile"))
             )

regex_batch <- sapply(syns, function(syn) syn$term)  
names(regex_batch) <- sapply(syns, function(x) paste(x$syns, collapse = "|"))  
text_res <- str_replace_all(text, regex_batch) 

text_res
[1] "my car is quite happy_emotion" "I happy_emotion my car"  
Sam S.
  • 627
  • 1
  • 7
  • 23