Replace words in text2vec efficiently

Question

I have a large text body where I want to replace words with their respective synonyms efficiently (for example replace all occurrences of "automobile" with the synonym "car"). But I struggle to find a proper (efficient way) to do this.

For the later analysis I use the text2vec library and would like to use that library for this task as well (avoiding tm to reduce dependencies).

An (inefficient) way would look like this:

# setup data
text <- c("my automobile is quite nice", "I like my car")

syns <- list(
  list(term = "happy_emotion", syns = c("nice", "like")),
  list(term = "car", syns = c("automobile"))
)

My brute-force solution is to have something like this and use a loop to look for the words and replace them

library(stringr)
# works but is probably not the best...
text_res <- text
for (syn in syns) {
  regex <- paste(syn$syns, collapse = "|")
  text_res <-  str_replace_all(text_res, pattern = regex, replacement = syn$term)
}
# which gives me what I want
text_res
# [1] "my car is quite happy_emotion" "I happy_emotion my car"

I used to do it with tm using this approach by MrFlick (using tm::content_transformer and tm::tm_map), but I want to reduce the dependencies of the project by replacing tm with the faster text2vec.

I guess the optimal solution would be to somehow use text2vecs itoken, but I am unsure how. Any ideas?

score 3 · Accepted Answer · answered Jan 26 '17 at 06:09

Quite late, but still I want to add my 2 cents. I see 2 solutions

Small improvement over your str_replace_all. Since it is vectorized internally you can make all replacements without loop. I think it will be faster, but I didn't make any benchmarks.
```
regex_batch = sapply(syns, function(syn) paste(syn$syns, collapse = "|"))  
names(regex_batch) = sapply(syns, function(x) x$term)  
str_replace_all(text, regex_batch)  
```

Naturally this task is for hash-table lookup. Fastest implementation as far as I know is in fastmatchpackage. So you can write custom tokenizer, something like:

library(fastmatch)

syn_1 = c("nice", "like")
names(syn_1) = rep('happy_emotion', length(syn_1))
syn_2 = c("automobile")
names(syn_2) = rep('car', length(syn_2))

syn_replace_table = c(syn_1, syn_2)

custom_tokenizer = function(text) {
  word_tokenizer(text) %>% lapply(function(x) {
    i = fmatch(x, syn_replace_table)
    ind = !is.na(i)
    i = na.omit(i)
    x[ind] = names(syn_replace_table)[i]
    x
  })
}

I would bet that second solution will work faster, but need to make some benchmarks.

That looks like a very interesting concept! I did some quick benchmarks on them. While on small synonym-samples, the for-loop is faster, the `fastmatch` approach is a lot faster on larger lists! Also, as I am still working on the project, your 2 cents are very valuable here! — David, Jan 26 '17 at 06:46
Also note, that `text2vec::word_tokenizer` is quite slow compared to `stringr::str_split(TEXT_HERE, pattern = stringr::boundary("word"))`. The only reason I'm not using `stringr`/`stringi`/`tokenizers` is that I want to keep number of `text2vec` dependencies as small as possible. — Dmitriy Selivanov, Jan 26 '17 at 07:38

count · Answer 2 · 2017-01-11T09:55:17.043

2

With base R this should work:

mgsub <- function(pattern,replacement,x) {
if (length(pattern) != length(replacement)){
    stop("Pattern not equal to Replacment")
} 
    for (v in 1:length(pattern)) {
        x  <- gsub(pattern[v],replacement[v],x)
    }
return(x )
}

mgsub(c("nice","like","automobile"),c(rep("happy_emotion",2),"car"),text)

edited Jan 11 '17 at 09:55

answered Jan 11 '17 at 09:52

count

1,328
9
16

Isn't this very similar to the loop I posted? Just replacing `stringr::str_replace_all` with `gsub`? – David Jan 11 '17 at 09:54
It is, but as you wanted to reduce dependencies I figured you wanted a base R solution. Have you checked out [this](http://stackoverflow.com/questions/29273928/faster-approach-than-gsub-in-r) ? – count Jan 11 '17 at 10:00
I was aware of the `perl = T` (which indeed adds speed), and it seems to be faster than `stringr`. But still, I wonder if `text2vec` offers a faster alternative (by using the tokens...) – David Jan 11 '17 at 10:08

Sam S. · Answer 3 · 2017-10-26T06:27:27.987

The first part of solution by Dmitriy Selivanov requires a small change.

library(stringr)    

text <- c("my automobile is quite nice", "I like my car")

syns <- list(
             list(term = "happy_emotion", syns = c("nice", "like")),
             list(term = "car", syns = c("automobile"))
             )

regex_batch <- sapply(syns, function(syn) syn$term)  
names(regex_batch) <- sapply(syns, function(x) paste(x$syns, collapse = "|"))  
text_res <- str_replace_all(text, regex_batch) 

text_res
[1] "my car is quite happy_emotion" "I happy_emotion my car"

Replace words in text2vec efficiently

3 Answers3

Linked