0

I am facing encoding issue while working with Russian text. I have a list of Russian text (sample shown below) and I am applying synonyms by replacing similar keywords with one keyword.

mydata=
c("Проведение глюкозотолерантного теста",
"Проведение гониоскопической компрессионной пробы Форбса",
"Проведение комплексной медико-автотехнической экспертизы в отношении трупа и живых лиц",
"Проведение комплексного аутопсийного исследования плода и новорожденного")

syns <- list(
list(term="проба", syns= c("проба","пробы")),
list(term="толстокишеч", syns= c("толстокишеч","толст")))

regex_batch <- sapply(syns, function(syn) syn$term)
names(regex_batch) <- sapply(syns, function(x) paste0("\\b",  paste(x$syns, collapse = "\\b|\\b"), "\\b"))
text_res <- str_replace_all(mydata, regex_batch) 
text_res =  stri_encode(text_res,"UTF-8", "")
View(text_res)

The code is working fine as it replaces пробbl with проба.

Проведение гониоскопической компрессионной **проба** Форбса

But instead of creating mydata vector, if I import the data from XLSX file into R. the code is not working fine. It does not return any error but it doesn't replace synonyms. TO investigate the issue, I checked the encoding.

Encoding(if I use character vector)-- "unknown".
Encoding(if I import data from XLSX)-- "UTF-8".

I also used stri_encode(k,"UTF-8","") to remove UTF-8 encoding but it didn't work. I am using read_excel() function from readxl package to import data from XLSX file. Since I have many rows of text, I need to import data from excel or csv

john
  • 1,026
  • 8
  • 19
  • Is the problem the same when you import from CSV? – Dan Nov 21 '17 at 19:27
  • Yes, the same problem while importing from CSV – john Nov 21 '17 at 19:48
  • The problem isn't that you need to remove UTF-8 encoding, the problem is they both should have UTF-8 encoding. What operating system are you using? Windows? Linux? This can vary based on how your computer is set up. If you make sure everything uses UTF-8 you should be safest. It really says "unknown" encoding for a vector with russian characters? That seems odd. A more [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) is probably necessary here. – MrFlick Nov 21 '17 at 20:21
  • @john What encoding should your data be in? I'm assuming you want it in UTF-8? – petergensler Nov 22 '17 at 01:49

0 Answers0