I have a large text body where I want to replace words with their respective synonyms efficiently (for example replace all occurrences of "automobile" with the synonym "car"). But I struggle to find a proper (efficient way) to do this.
For the later analysis I use the text2vec
library and would like to use that library for this task as well (avoiding tm
to reduce dependencies).
An (inefficient) way would look like this:
# setup data
text <- c("my automobile is quite nice", "I like my car")
syns <- list(
list(term = "happy_emotion", syns = c("nice", "like")),
list(term = "car", syns = c("automobile"))
)
My brute-force solution is to have something like this and use a loop to look for the words and replace them
library(stringr)
# works but is probably not the best...
text_res <- text
for (syn in syns) {
regex <- paste(syn$syns, collapse = "|")
text_res <- str_replace_all(text_res, pattern = regex, replacement = syn$term)
}
# which gives me what I want
text_res
# [1] "my car is quite happy_emotion" "I happy_emotion my car"
I used to do it with tm
using this approach by MrFlick (using tm::content_transformer
and tm::tm_map
), but I want to reduce the dependencies of the project by replacing tm
with the faster text2vec
.
I guess the optimal solution would be to somehow use text2vec
s itoken
, but I am unsure how. Any ideas?