I am facing encoding issue while working with Russian text. I have a list of Russian text (sample shown below) and I am applying synonyms by replacing similar keywords with one keyword.
mydata=
c("Проведение глюкозотолерантного теста",
"Проведение гониоскопической компрессионной пробы Форбса",
"Проведение комплексной медико-автотехнической экспертизы в отношении трупа и живых лиц",
"Проведение комплексного аутопсийного исследования плода и новорожденного")
syns <- list(
list(term="проба", syns= c("проба","пробы")),
list(term="толстокишеч", syns= c("толстокишеч","толст")))
regex_batch <- sapply(syns, function(syn) syn$term)
names(regex_batch) <- sapply(syns, function(x) paste0("\\b", paste(x$syns, collapse = "\\b|\\b"), "\\b"))
text_res <- str_replace_all(mydata, regex_batch)
text_res = stri_encode(text_res,"UTF-8", "")
View(text_res)
The code is working fine as it replaces пробbl with проба.
Проведение гониоскопической компрессионной **проба** Форбса
But instead of creating mydata vector, if I import the data from XLSX file into R. the code is not working fine. It does not return any error but it doesn't replace synonyms. TO investigate the issue, I checked the encoding.
Encoding(if I use character vector)-- "unknown".
Encoding(if I import data from XLSX)-- "UTF-8".
I also used stri_encode(k,"UTF-8","")
to remove UTF-8 encoding but it didn't work. I am using read_excel()
function from readxl package to import data from XLSX file. Since I have many rows of text, I need to import data from excel or csv